我有这样的HTML:
<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
我有一个基本的Nokogiri CSS节点搜索返回
内容,但我找不到关于如何在第n个关闭的H2和下一个打开的H2之间定位所有文本的例子。我正在创建一个CSV的输出,所以我也想在文件列表中读取,并把URL作为第一个结果。
require 'rubygems'
require 'nokogiri'
h = '<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
'
doc = Nokogiri::HTML(h)
# Specify the range between delimiter tags that you want to extract
# triple dot is used to exclude the end point
# 1...2 means 1 and not 2
EXTRACT_RANGES = [
2...3,
4...5
]
# Tags which count as delimiters, not to be extracted
DELIMITER_TAGS = [
"h1",
"h2"
]
extracted_text = []
i = 0
# Change /"html"/"body" to the correct path of the tag which contains this list
(doc/"html"/"body").children.each do |el|
if (DELIMITER_TAGS.include? el.name)
i += 1
else
extract = false
EXTRACT_RANGES.each do |cur_range|
if (cur_range.include? i)
extract = true
break
end
end
if extract
s = el.inner_text.strip
unless s.empty?
extracted_text << el.inner_text.strip
end
end
end
end
# Print out extracted text (each element's inner text is separated by newlines)
puts extracted_text.join("n")
有时你可以使用NodeSet的&获取节点间信息的操作符:
doc.xpath('//h2[1]/following-sibling::p') & doc.xpath('//h2[2]/preceding-sibling::p')
如果开始元素和停止元素具有相同的父元素,则只需使用单个XPath即可。首先,为了清晰起见,我将使用一个简化的文档来展示它,然后使用示例文档:
XML = "<root>
<a/><a1/><a2/>
<b/><b1/><b2/>
<c/><c1/><c2/>
</root>"
require 'nokogiri'
xml = Nokogiri::XML(XML)
# Find all elements between 'a' and 'c'
p xml.xpath('//*[preceding-sibling::a][following-sibling::c]').map(&:name)
#=> ["a1", "a2", "b", "b1", "b2"]
# Find all elements between 'a' and 'b'
p xml.xpath('//*[preceding-sibling::a][following-sibling::b]').map(&:name)
#=> ["a1", "a2"]
# Find all elements after 'c'
p xml.xpath('//*[preceding-sibling::c]').map(&:name)
#=> ["c1", "c2"]
现在,这是你的用例(通过索引查找):
HTML = "<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p>Extract me!</p>
<p>Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p>Extract me three!</p>
<p>Extract me four!</p>"
require 'nokogiri'
html = Nokogiri::HTML(HTML)
# Find all elements between the first and second h2s
p html.xpath('//*[preceding-sibling::h2[1]][following-sibling::h2[2]]').map(&:content)
#=> ["Extract me!", "Extract me too!"]
# Find all elements between the third h2 and the end
p html.xpath('//*[preceding-sibling::h2[3]]').map(&:content)
#=> ["Extract me three!", "Extract me four!"]
这里有一个简单的(naïve)实现,它假设开始和停止元素共享相同的父元素,并允许独立指定开始和停止的XPath:
HTML = "<h1>Header is here</h1>
<h2>Header 2 is here</h2>
<p>Extract me!</p>
<p>Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p>Extract me three!</p>
<p>Extract me four!</p>"
require 'nokogiri'
class Nokogiri::XML::Node
# Naive implementation; assumes found elements will share the same parent
def content_between( start_xpath, stop_xpath=nil )
node = at_xpath(start_xpath).next_element
stop = stop_xpath && at_xpath(stop_xpath)
[].tap do |content|
while node && node!=stop
content << node
node = node.next_element
end
end
end
end
html = Nokogiri::HTML(HTML)
puts html.content_between('//h2[1]','//h2[2]').map(&:content)
#=> Extract me!
#=> Extract me too!
puts html.content_between('//h2[3]').map(&:content)
#=> Extract me three!
#=> Extract me four!
这段代码可能会对你有所帮助,但它仍然需要更多关于标签位置的信息(如果你需要提取的信息将位于一些标签之间,那就更好了)
require 'rubygems'
require 'nokogiri'
require 'pp'
html = '<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
';
doc = Nokogiri::HTML(html);
doc.xpath("//p").each do |el|
pp el
end