这是我custom_filters.py文件:
from scrapy.dupefilter import RFPDupeFilter
class SeenURLFilter(RFPDupeFilter):
def __init__(self, path=None):
self.urls_seen = set()
RFPDupeFilter.__init__(self, path)
def request_seen(self, request):
if request.url in self.urls_seen:
return True
else:
self.urls_seen.add(request.url)
添加了以下行:
DUPEFILTER_CLASS = 'crawl_website.custom_filters.SeenURLFilter'
到 settings.py
当我检查生成的csv文件时,它会多次显示一个URL。这有错吗?
来自: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
然后在您的settings.py
中添加:
ITEM_PIPELINES = {
'your_bot_name.pipelines.DuplicatesPipeline': 100
}
编辑:
要检查重复的网址,请使用:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.urls_seen = set()
def process_item(self, item, spider):
if item['url'] in self.urls_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.urls_seen.add(item['url'])
return item
这需要您的项目中有一个url = Field()
。像这样的东西(items.py):
from scrapy.item import Item, Field
class PageItem(Item):
url = Field()
scraped_field_a = Field()
scraped_field_b = Field()