Nokogiri和机械化帮助(通过div类和抓取导航到页面)

我需要帮助点击一些元素通过div类，而不是通过链接的文本，得到一个页面来抓取一些数据。

从页面http://www.salatomatic.com/b/United-States+125开始，我如何单击每个州的名称而不使用链接的文本，但通过div类?
点击状态后，例如http://www.salatomatic.com/b/Alabama+7，我需要点击状态中的区域，再次通过div类，而不是链接的文本。
在一个区域内，www [dot] salatomatic [dot] com/c/Birmingham+12，我想循环通过，点击每个项目(在这个例子中是11个清真寺)。
在项目/清真寺内，我需要刮取地址(在清真寺标题的顶部)并将其存储/创建在我的数据库中。

更新:

我现在有这个了:

require 'nokogiri'
require 'open-uri'
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.salatomatic.com/b/United-States+125")    

#loops through all state links
page.search('.subtitleLink a').map{|a| page.uri.merge a[:href]}.each do |uri|
  page2 = agent.get uri
        #loops through all regions in each state
        page2.search('.subtitleLink a').map{|a| page2.uri.merge a[:href]}.each do |uri|
            page3 = agent.get uri
            #loops through all places in each region
            page3.search('.subtitleLink a').map{|a| page3.uri.merge a[:href]}.each do |uri|
             page4 = agent.get uri
                      #I'm able to grab the title of the place but not sure how to get the address b/c there is no div around it.
                       puts page4.at('.titleBM')
                      #I'm guessing I would use some regex/xpath here to get the address, but how would that work?
                      #This is the structure of the title/address in HTML:
                      <td width="100%"><div class="titleBM">BIS Hoover Crescent Islamic Center </div>2524 Hackberry Lane, Hoover, AL 35226</td> This is the listing page: http://www.salatomatic.com/d/Hoover+12446+BIS-Hoover-Crescent-Islamic-Center
            end
        end             
end

首先要确保将a[:href]转换为绝对url。因此,也许:

page.search('.subtitleLink a').map{|a| page.uri.merge a[:href]}.each do |uri|
  page2 = agent.get uri
end

对于美国和地区的页面，您可以:

agent = Mechanize.new
page = agent.get('http://www.salatomatic.com/b/United-States+125')
page.search("#header a").each { |a| ... }

在此块内，您可以找到相应的链接并单击:

page.link_with(text: a.text).click

或要求mechanize通过href:

加载页面

region_page = agent.get a[:href]

在区域内你也可以这样做，只要搜索

page.search(".tabTitle a").each ...

用于标签(餐馆，市场，学校等)，如

page.search(".subtitleLink a").each ...

如何找到这些东西?尝试一些bookmarklet，如SelectorGadget或类似的，挖掘HTML源代码，找到你感兴趣的链接的常见父/类。

相关内容

最新更新

热门标签：