使用Nokogiri提取可选地址组件



这是我第一次尝试使用Nokogiri解析网页。

我正在尝试从网页中提取地址,并将其存储在CSV文件中。到目前为止,我只能提取City、State和Zip字段。

我不知道如何提取设施名称、地址、电话、号码和公司信息。地址可能包含一个或两个街道组成部分。

对于手机,可能有一个或多个电话号码。电话号码可以是常规号码或传真号码,但它们只在文本中显示,而不是标记。对于公司来说,我希望能够提取URL和名称。

页面上的每个地址如下所示:

<!-- address entry -->
<div id='1234' class='address'> 
<div class='address_header'> 
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div> 
<div class='address_details'> 
<div class='info'> 
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>, 
<span class='state'>NY</span> 
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>  
</div> 
</div>  
<!-- address entry -->
<!-- address entry -->
<div id='4567' class='address'> 
<div class='address_header'> 
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div> 
<div class='address_details'> 
<div class='info'> 
<p class='address'>
<span class='street'>456 DEF Rd</span><br />
<span class='city'>New York</span>, 
<span class='state'>NY</span> 
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>555.555.5555</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>  
</div> 
</div>  
<!-- address entry -->

这是我的基本设置。

require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(open('[URL]'))
Cities = Array.new
States = Array.new
Zips = Array.new
doc.css("p[class='address']").css("span[class='city']").each do |city|
Cities << city.content
end
doc.css("p[class='address']").css("span[class='state']").each do |state|
States << state.content
end
doc.css("p[class='address']").css("span[class='zip']").each do |zip|
Zips << zip.content
end
CSV.open("myCSV.csv", "wb") do |row|
row << ["City", "State", "Zip"]
(0..Cities.length - 1).each do |index|
row << [Cities[index], States[index], Zips[index]]
end
end

在这里将信息存储在单独的数组中似乎非常笨拙。我基本上想在CSV表中为源文档中每次出现的地址节点创建一个行条目,然后用字段填充(如果存在):

Facility  St_1  St_2  City  State  Zip  Phone  Fax  URL  Company
========  ===== ===== ===== ====== ==== ====== ==== ==== ============
xxxxxxxx  xxxx        xxxx  xxxxx  xxxx xxxxx       xxxx xxxxxxxx
xxxxxxxx  xxxx  xxxxx xxxx  xxxxx  xxxx xxxxx  xxxx xxxx xxxxxxxx

有人能帮我吗?

您可能有一些边缘情况无法处理,但这会照顾到您的示例。您需要将文档更改为从实际页面读取,而不是从数据段读取,还需要将csv更改为打印到文件,而不是像我所做的那样内联显示。

require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(DATA.read)
CompanyInfo   = Struct.new :facility, :street1, :street2, :city, :state, :zip, :phone, :fax, :url, :company
company_infos = []
doc.css("div.address").each do |address_div|
facility         = address_div.at_css('.address_header .header_name').text.strip
info             = address_div.css('div.address_details .info')
street1, street2 = info.css('.street').map(&:text)
city             = info.at_css('.city').text
state            = info.at_css('.state').text
zip              = info.at_css('.zip').text
phone, fax       = info.css('.phone .tel').map(&:text)
url              = info.at_css('.company a')['href']
company          = info.at_css('.company a').text
company_infos << CompanyInfo.new(facility, street1, street2, city, state, zip, phone, fax, url, company)
end
csv = CSV.generate do |csv|
csv << %w[Facility Street1 Street2 City State Zip Phone Fax URL Company]
company_infos.each do |company_info|
csv << company_info.to_a
end
end
csv # => "Facility,Street1,Street2,City,State,Zip,Phone,Fax,URL,CompanynFacility Name,123 ABC St,Unit 1,New York,NY,10022,999.999.9999,888.888.8888,{URL},Company Namen"

__END__
<!-- address entry -->
<div id='1234' class='address'> 
<div class='address_header'> 
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div> 
<div class='address_details'> 
<div class='info'> 
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>, 
<span class='state'>NY</span> 
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>  
</div> 
</div>

你要求很多,但我会让你开始:

fields = %w{street1 street2 phone fax city state zip}
doc.search('div.address').each do |div|
address = {}
address['street1'], address['street2'] = *div.search('span.street').map(&:text)
address['phone'], address['fax'] = *div.search('span.tel').map(&:text)
['city', 'state', 'zip'].each{|f| address[f] = div.at("span.#{f}").text}
csv << fields.map{|f| address[f]}
end

最新更新