如何阻止scrapy对重复记录的页面进行分页?

我试着用scrapy抓取一个白色分页的网站，它是好的!但是，随着这个网站的更新和新的帖子被添加到这个网站，我需要每天运行我的代码，所以每次我运行我的代码，它都会抓取所有的页面。幸运的是，我使用的是django，在django模型中，我使用了

独特= True

所以我的数据库中没有重复的记录，但是我想在它发现重复记录时立即停止分页爬行。我该怎么做呢?下面是我的蜘蛛代码片段:

class NewsSpider(scrapy.Spider):
name = 'news'
allowed_domains = ['....']
start_urls = ['....']
duplicate_record_flag = False
def parse(self, response, **kwargs):
next_page = response.xpath('//a[@class="next page-numbers"]/@href').get()
news_links = response.xpath('//div[@class="content-column"]/div/article/div/div[1]/a/@href').getall()
for link in news_links:
if self.duplicate_record_flag:
print("Closing Spider ...")
raise CloseSpider('Duplicate records found')
yield scrapy.Request(url=link, callback=self.parse_item)
if next_page and not self.duplicate_record_flag:
yield scrapy.Request(url=next_page, callback=self.parse)

def parse_item(self, response):
item = CryptocurrencyNewsItem()
...
try:
CryptocurrencyNews.objects.get(title=item['title'])
self.duplicate_record_flag = True           
return
except CryptocurrencyNews.DoesNotExist:         
item.save()
return item

我使用了一个类变量(duplicate_record_flag)在所有函数中都可以访问它，并且要知道当我面对重复的记录时?问题是，当发现第一个重复记录时，爬行器并没有实时停止!澄清一下:在parse函数的For迭代中，如果我们有10个news_links，并且在第一次迭代中我们发现了一个重复记录，那么我们的标志在那一刻不会改变，如果我们在For循环中打印标志，它将打印10 "False"每次迭代的值!!而应该改成"True"在第一次迭代中!

换句话说，爬虫抓取每次解析中每个页面中的所有链接!我该如何预防呢?

如果您想在满足特定条件后停止爬行器，您可以引发closeespider

if some_logic_to_check_duplicates:
raise CloseSpider('Duplicate records found') 
# This message shows up in the logs

如果您只是想跳过重复的项，您可以从管道中引发DropItem异常。Scrapy文档中的示例代码:

class DuplicatesPipeline:
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter['id'] in self.ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.ids_seen.add(adapter['id'])
return item

相关内容

最新更新

热门标签：