Ruby海葵蜘蛛添加一个标签，每个url访问

我设置了一个爬行器:

require 'anemone'
Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
  puts page.url
end
end

然而，我希望蜘蛛使用谷歌分析反跟踪标签，它访问的每个URL，而不一定实际点击链接。

我可以使用一次爬行器并存储所有的URL，并使用WATIR来运行它们并添加标记，但我想避免这种情况，因为它很慢，我喜欢skip_links_like和页面深度函数。

我如何实现这个?

您想在加载URL之前添加一些内容，对吗?您可以使用focus_crawl。

Anemone.crawl("http://www.website.co.uk", :depth_limit => 1) do |anemone|
    anemone.focus_crawl do |page|
        page.links.map do |url|
            # url will be a URI (probably URI::HTTP) so adjust
            # url.query as needed here and then return url from
            # the block.
            url
        end
    end
    anemone.on_every_page do |page|
        puts page.url
    end
end

用于过滤URL列表的focus_crawl方法:

指定一个块，它将在每个页面上选择要遵循的链接。该块应该返回一个URI对象数组。

但是你也可以把它作为一个通用的URL过滤器。

例如，如果您想将atm_source=SiteCon&atm_medium=Mycampaign添加到所有链接中，那么您的page.links.map将看起来像这样:

page.links.map do |uri|
    # Grab the query string, break it into components, throw out
    # any existing atm_source or atm_medium components. The to_s
    # does nothing if there is a query string but turns a nil into
    # an empty string to avoid some conditional logic.
    q = uri.query.to_s.split('&').reject { |x| x =~ /^atm_(source|medium)=/ }
    # Add the atm_source and atm_medium that you want.
    q << 'atm_source=SiteCon' << 'atm_medium=Mycampaign'
    # Rebuild the query string 
    uri.query = q.join('&')
    # And return the updated URI from the block
    uri
end

如果atm_source或atm_medium包含非url安全字符，则对其进行uri编码

相关内容

最新更新

热门标签：