如何避免重复的条目在网站上爬行

我想与Ruby，Nokogiri和Mechanizize爬一家商店。

在一页上显示了两篇文章，我知道所有文章均在地址中的.../p/...开头，因此这就是为什么我将其存储在article_links中的原因。所有/p/链接都应显示。

通常我会看到两个地址：

agent = Mechanize.new
page = agent.get(exampleshop.com)
article_links = page.links_with(href: %r{.*/p/})
article_links.map do |link|
    article = link.click
    target_URL = page.uri + link.uri #full URL
    puts "#{target_URL}"
end   
#crawling stuff on /p/ pages not included here

但是，最终每个链接都是复制的，这已经发生在循环之前，所以我可以看到：

exampleshop.com/p/productxy.html
exampleshop.com/p/productxy.html
exampleshop.com/p/productab.html
exampleshop.com/p/productab.html

我相信，该网站代码中的每种产品有两个HREF，/p/有两个HREF。有什么好方法可以防止这种情况吗？还是可以在links_with中使用Nokogiri CSS？

您可以在列表上迭代之前删除重复项：

而不是

article_links.map do |link|

写

article.links.uniq { |link| link.uri }.map do |link|

将使用重复的URI删除任何链接。

您可以使用CSS Regex选择器而不是links_with，但是您仍然需要在Ruby中删除重复项：

article_links = page.css("a[href*='/p/']")

您仍然需要在Ruby中删除重复项的原因是CSS无法选择比赛的第一个元素。nth-type或nth-child在这里无法工作。

相关内容

最新更新

热门标签：