我需要从HTML中解析出图像URL,如下所示:
<p><a href="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" ><img class="aligncenter size-full wp-image-12313" alt="Example image Name" src="http://blog.website.com/wp-content/uploads/2012/02/image_name.jpg" width="630" height="119" /></a></p>
到目前为止,我使用Nokogiri来解析<h2>
标签:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://blog.website.com/"))
headers = page.css('h2')
puts headers.text
我有两个问题:
- 如何解析图像url
- 理想情况下,我会以以下格式打印到控制台:
1.收割台1image_url 1image_url 2(如果有)2.收割台22image_url 12image_url 2(如果有)
到目前为止,我还没能用这种漂亮的格式打印我的页眉。我该怎么做?
<h2><a href="http://blog.website.com/2013/02/15/images/" rel="bookmark" title="Permanent Link to Blog Post">Blog Post</a></h2>
<p class="post_author"><em>by</em> author</p>
<div class="format_text">
<p style="text-align: left;">Blog Content </p>
<p style="text-align: left;"> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><a href="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" ><img class="alignnone size-full wp-image-23382" alt="image2" src="http://blog.website.com/wp-content/uploads/2012/02/image21.jpg" width="630" height="210" /></a></p>
<p style="text-align: left;">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Items: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvaf812e3" target="_blank">Items for Spring</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">More Items: <a href="http://www.website.com/threads#/show/thread/A_abv2a6822e2" target="_blank">Lorem Ipsum</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Still more items: <a href="http://www.website.com/threads#/show/thread/A_abv7af882e3" target="_blank">Items:</a></b></p>
<p style="text-align: center;">Lorem Ipsum.</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">Lorem ipsum: <a href="http://www.website.com/threads?src=login#/show/thread/A_abvea6832e8" target="_blank">Items</a></b></p>
<p style="text-align: center;">Lorem Ipusm</p>
<p style="text-align: center;"><b id="internal-source-marker_0.054238131968304515">
</div>
<p class="to_comments"><span class="date">February 15, 2013</span> <span class="num_comments"><a href="http://blog.website.com/2013/02/15/Blog-post/#respond" title="Comment on Blog Post">No Comments</a></span></p>
我认为先按h2分组更有意义:
doc.search('h2').each_with_index do |h2, i|
puts "#{i+1}."
puts h2.text
h2.search('+ p + div > p[3] img').each do |img|
puts img['src']
end
end
要获取图像,只需查找具有src
属性的img
标记即可。
如果你想要h2
与每个图像关联,你可以这样做:
doc.xpath('//img').each do |img|
puts "Header: #{img.xpath('preceding::h2[1]').text}"
puts " Image: #{img['src']}"
end
注意,切换到XPath是为了preceding::
轴。
编辑
要按标题分组,可以将它们放入哈希中:
headers = Hash.new{|h,k| h[k] = []}
doc.xpath('//img').each do |img|
header = img.xpath('preceding::h2[1]').text
image = img['src']
headers[header] << image
end
要获得您指定的输出:
headers.each do |h,urls|
puts "#{h} #{urls.join(' ')}"
end
我最终使用的代码。随意批评(我可能会从中吸取教训):
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://blog.website.com/"))
doc.xpath('//h2/a[@rel = "bookmark"]').each_with_index do |header, i|
puts i+1
puts " Title: #{header.text}"
puts " Image 1: #{header.xpath('following::img[1]')[0]["src"]}"
puts " Image 2: #{header.xpath('following::img[2]')[0]["src"]}"
end
我曾经做过类似的事情(实际上我想要完全相同的输出)。这个解决方案很容易遵循:
根据DOM的结构,您可以执行以下操作:
body = page.css('div.format_text')
headers = page.css('div#content_inner h2 a')
post_counter = 1
body.each_with_index do |body,index|
header = headers[index]
puts "#{post_counter}. " + header
body.css('p a img, div > img').each{|img| puts img['src'] if img['src'].match(/Ahttp/) }
post_counter += 1
end
因此,基本上,您要用1个或多个图像检查每个标头。我正在解析的页面的页眉位于图像div之外,这就是为什么我使用两个不同的变量来查找它们(body/headers)。此外,在查找图像时,我针对两个类,因为这是这个特定DOM的结构方式。
这应该会给你一个很好的干净的输出,就像你想要的那样。
希望这能有所帮助!