我正试图找出解析搜索结果屏幕的最佳方法,该屏幕由25个类似于以下内容的重复卡盘组成:
名称:JOHN DOE/公司名称
状态:活动
加入日期:2007-08-17
地址:大街123号
城市:ANYTOWN州/地区/其他:纽约国家:
邮政编码/邮政编码:10101
我设法解析并清理了这个页面,返回了25个结果集中的一个,但我一直纠结于如何返回其余的结果集。我曾想过实现一个从9递增到33的变量,但无法实现。我使用的代码如下:
require "nokogiri"
class String
def astrip
self.gsub(/([x09|x0D|n|t])|(xc2xa0){1,}/u, '').strip
end
end
i = 9
f = File.open("testpage.html", "r:iso-8859-1:utf-8")
doc = Nokogiri::HTML(f)
NAME = doc.css(":nth-child(" + i.to_s + ") div:nth-child(1) a").text.astrip.split("/")
NAME_URL = doc.css(":nth-child(" + i.to_s + ") div:nth-child(1) a").map { |link| link['href'] }
STATUS = doc.css(":nth-child(" + i.to_s + ") div:nth-child(2) a").text
JOINED = doc.css(":nth-child(" + i.to_s + ") div:nth-child(3)").text.gsub("Date Joined:", "").astrip.strip
ADDRESS1 = doc.css(":nth-child(" + i.to_s + ") div:nth-child(4)").text.gsub("Address:", "").astrip.strip
ADDRESS2 = doc.css(":nth-child(" + i.to_s + ") div:nth-child(5)").text.astrip.gsub("City:", "").gsub("State/Territory/Other", "").gsub("Country", "").split(":")
ZIPCODE = doc.css(":nth-child(" + i.to_s + ") div:nth-child(6)").text.gsub("Postal Code/Zip Code:", "").astrip.strip
Output = NAME[0].strip, NAME[1].strip, NAME_URL[0].to_s.strip, STATUS, JOINED, ADDRESS1, ADDRESS2[0].strip, ADDRESS2[1].strip, ADDRESS2[2].strip, ZIPCODE
p Output
它返回一个我很满意的输出,看起来像这样:
["JOHN DOE", "COMPANY NAME", "http://linktoprofile/johndoe", "ACTIVE", "2007-08-17", "123 MAIN STREET", "ANYTOWN", "NEW YORK", "US", "10101"]
如果没有示例HTML,我们提供工作解决方案的能力非常有限。
这应该给你一个工作的起点:
require 'nokogiri'
html = <<EOT
<html>
<body>
<div>
<p><b>Name:</b> JOHN DOE / COMPANY NAME</p>
<p><b>Status:</b> ACTIVE</p>
<p><b>Date Joined:</b> 2007-08-17</p>
<p><b>Address:</b> 123 MAIN STREET</p>
<p><b>City:</b> ANYTOWN <b>State/Territory/Other:</b> NEW YORK <b>Country:</b> US</p>
<p><b>Postal Code/Zip Code:</b> 10101</p>
</div>
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
data = doc.search('div').map { |div|
name = div.at('//p[1]').text[/:(.+)/, 1].strip
status = div.at('//p[2]').text[/:(.+)/, 1].strip
date_joined = div.at('//p[3]').text[/:(.+)/, 1].strip
address = div.at('//p[4]').text[/:(.+)/, 1].strip
city_state_country = div.at('//p[5]').text
postal_code = div.at('//p[6]').text[/:(.+)/, 1].strip
city, state, country = (city_state_country.match(%r{City:(.+) State/Territory/Other:(.+) Country:(.+)}).captures).map{ |s| s.strip }
{
:name => name,
:status => status,
:date_joined => date_joined,
:address => address,
:city => city,
:state => state,
:country => country,
:postal_code => postal_code
}
}
结果输出看起来像:
require 'pp'
pp data
# >> [{:name=>"JOHN DOE / COMPANY NAME",
# >> :status=>"ACTIVE",
# >> :date_joined=>"2007-08-17",
# >> :address=>"123 MAIN STREET",
# >> :city=>"ANYTOWN",
# >> :state=>"NEW YORK",
# >> :country=>"US",
# >> :postal_code=>"10101"}]
如果你想要一个阵列阵列,在地图块中使用这个:
[
name,
status,
date_joined,
address,
city,
state,
country,
postal_code
]
将生成:
# >> [["JOHN DOE / COMPANY NAME",
# >> "ACTIVE",
# >> "2007-08-17",
# >> "123 MAIN STREET",
# >> "ANYTOWN",
# >> "NEW YORK",
# >> "US",
# >> "10101"]]
进行查找的另一种方法是:,我认为它更容易维护,因为它更干燥
data = doc.search('div').map { |div|
name,
status,
date_joined,
address,
city,
state,
country,
postal_code = [
'Name',
'Status',
'Date Joined',
'Address',
'City',
'State/Territory/Other',
'Country',
'Postal Code/Zip Code'
].map { |t|
div.at( %Q(//p/b[text()="#{t}:"]) ).next.text.strip
}