我正在尝试如何刮擦网站以获取数据。
这是经过几天的研究,我将其组合在一起,但是,诺科吉里的产量并不像我期望的那样"干净"。当我打印数组时,我在输出中获得了很多线路" /n
"。
require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'
# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')
# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)
# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
property_details = d.text
details_array.push(property_details)
end
Pry.start(binding)
在pry中,如果我显示 details_array
或 address_array
,输出看起来像:
[2] pry(main)> details_array
=> ["n n n n 2265 Tanglewood Cir NE,n Atlanta,n GAn 30345n n nn n Dresden Eastn n nn $289,900n n n n 3 bdn 2 ban 1,566 sq ftn
0.3 acres lotn n n n n Single Family Homen n n n n
Brokered by Re/Max Town And Countryn n n
n n n Brokered by n Re/Max
Town And Countryn n n n ", "n n
n n 2141 Dunwoody Gln,n
Atlanta,n GAn 30338n n nn
n n $469,900n n n
n 4 bdn 3 ban 2,850 sq
ftn 0.3 acres lotn 2 carn
n n n n Single Family Homen
n n n n Brokered by
Buckhead Home Realty Llcn n n n
n n Brokered by n Buckhead Home
Realty Llcn n n n ", "n n
n n 1048 Martin St SE,n
Atlanta,n GAn 30315n n nn
n Intown Southn Peoplestownn n n
n $164,900n n n n
5 bdn 3 ban 2,376 sq ftn
7,405 sq ft lotn n n n n
Single Family Homen n n n n
Brokered by Greenlet Llcn n n n
n n Brokered by n Greenlet Llcn
n n n ", "n n n n
1048 Martin St SE,n Atlanta,n GAn
30315n n nn n Intown Southn
Peoplestownn n n n $164,900n
n n n 5 bdn 3
ban 2,055 sq ftn 7,584 sq ft lotn
n n n n Single Family Homen
n n n n Brokered by
Greenlet, Llcn n n n n
n Brokered by n Greenlet, Llcn n
n n ", "n n n n
1991 Woodbine Ter NE,n Atlanta,n GAn
30329n n nn n Sagamore Hillsn
n n n $299,900n n n
n 3 bdn 1+ ban 1,449
sq ftn 0.8 acres lotn n n
n n Single Family Homen n n
n :
看来,您的选择器似乎没有足够挖掘文档。考虑一下:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<p>foo</p>
<p>bar</p>
</div>
</body>
</html>
EOT
doc.search('div').map(&:text) # => ["n foon barn "]
查看父标签的文本时,您将获得两个用于格式化HTML的文本节点,以及所需的<p>
节点的文本。
如果您钻到所需的实际节点,然后获取其文本,则将删除标签格式:
doc.search('div p').map(&:text) # => ["foo", "bar"]
请参阅"刮擦时如何避免从节点加入所有文本"。