Ruby Web刮擦(Nokogiri) - 清理



我正在尝试如何刮擦网站以获取数据。

这是经过几天的研究,我将其组合在一起,但是,诺科吉里的产量并不像我期望的那样"干净"。当我打印数组时,我在输出中获得了很多线路" /n"。

require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'
# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')
# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)
# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
    property_details = d.text
    details_array.push(property_details)
end
Pry.start(binding)

在pry中,如果我显示 details_arrayaddress_array,输出看起来像:

[2] pry(main)> details_array
=> ["n      n        n          n                2265 Tanglewood Cir NE,n            Atlanta,n            GAn            30345n n        nn        n          Dresden Eastn        n        nn            $289,900n          n          n            n        3 bdn                2 ban                1,566 sq ftn             
0.3 acres lotn            n          n        n          n            Single Family Homen          n        n          n            n  
Brokered by Re/Max Town And Countryn            n          n       
n        n          n            Brokered by n            Re/Max
Town And Countryn          n        n      n    ",  "n      n   
n          n                2141 Dunwoody Gln,n           
Atlanta,n            GAn            30338n          n        nn 
n          n            $469,900n          n          n          
n                4 bdn                3 ban                2,850 sq
ftn                0.3 acres lotn                2 carn           
n          n        n          n            Single Family Homen  
n        n          n            n              Brokered by
Buckhead Home Realty Llcn            n          n        n       
n          n            Brokered by n            Buckhead Home
Realty Llcn          n        n      n    ",  "n      n       
n          n                1048 Martin St SE,n           
Atlanta,n            GAn            30315n          n        nn 
n          Intown Southn          Peoplestownn        n        n 
n            $164,900n          n          n            n        
5 bdn                3 ban                2,376 sq ftn             
7,405 sq ft lotn            n          n        n          n     
Single Family Homen          n        n          n            n  
Brokered by Greenlet Llcn            n          n        n       
n          n            Brokered by n            Greenlet Llcn    
n        n      n    ",  "n      n        n          n         
1048 Martin St SE,n            Atlanta,n            GAn           
30315n          n        nn        n          Intown Southn     
Peoplestownn        n        n          n            $164,900n   
n          n            n                5 bdn                3
ban                2,055 sq ftn                7,584 sq ft lotn    
n          n        n          n            Single Family Homen  
n        n          n            n              Brokered by
Greenlet, Llcn            n          n        n        n         
n            Brokered by n            Greenlet, Llcn          n   
n      n    ",  "n      n        n          n               
1991 Woodbine Ter NE,n            Atlanta,n            GAn         
30329n          n        nn        n          Sagamore Hillsn   
n        n          n            $299,900n          n          n
n                3 bdn                1+ ban                1,449
sq ftn                0.8 acres lotn            n          n      
n          n            Single Family Homen          n        n  
n           :

看来,您的选择器似乎没有足够挖掘文档。考虑一下:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div>
      <p>foo</p>
      <p>bar</p>
    </div>
  </body>
</html>
EOT
doc.search('div').map(&:text) # => ["n      foon      barn    "]

查看父标签的文本时,您将获得两个用于格式化HTML的文本节点,以及所需的<p>节点的文本。

如果您钻到所需的实际节点,然后获取其文本,则将删除标签格式:

doc.search('div p').map(&:text) # => ["foo", "bar"]

请参阅"刮擦时如何避免从节点加入所有文本"。

相关内容

  • 没有找到相关文章

最新更新