如何通过单个管道运行多个蜘蛛?



新手刚开始玩scrapy

我的目录结构如下…

#FYI: running on Scrapy 2.4.1
WebScraper/
  Webscraper/
     spiders/
        spider.py    # (NOTE: contains spider1 and spider2 classes.)
     items.py
     middlewares.py
     pipelines.py    # (NOTE: contains spider1Pipeline and spider2Pipeline)
     settings.py     # (NOTE: I wrote here:
                     #ITEM_PIPELINES = {
                     #  'WebScraper.pipelines.spider1_pipelines': 300,
                     #  'WebScraper.pipelines.spider2_pipelines': 300,
                     #} 
  scrapy.cfg

spider2.py相似…

class OneSpider(scrapy.Spider):
    name = "spider1"
    def start_requests(self):
        urls = ["url1.com",]
        yield scrapy.Request(
            url="http://url1.com",
            callback=self.parse
        )
    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff
class TwoSpider(scrapy.Spider):
    name = "spider2"
    def start_requests(self):
        urls = ["url2.com",]
        yield scrapy.Request(
            url="http://url2.com",
            callback=self.parse
        )
    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

pipelines.py看起来像…

class spider1_pipelines(object): 
    def __init__(self): 
        self.csvwriter = csv.writer(open('spider1.csv', 'w', newline=''))
        self.csvwriter.writerow(['header1', 'header2'])
    def process_item(self, item, spider):
        row = []
        row.append(item['header1'])
        row.append(item['header2'])
        self.csvwrite.writerow(row)
        
class spider2_pipelines(object):
    def __init__(self): 
        self.csvwriter = csv.writer(open('spider2.csv', 'w', newline=''))
        self.csvwriter.writerow(['header_a', 'header_b'])
    def process_item(self, item, spider):
        row = []
        row.append(item['header_a']) #NOTE: this is not the same as header1
        row.append(item['header_b']) #NOTE: this is not the same as header2
        self.csvwrite.writerow(row)

我有一个关于使用一个终端命令在不同url上运行spider1和spider2的问题:

nohup scrapy crawl spider1 -o spider1_output.csv --logfile spider1.log & scrapy crawl spider2 -o spider2_output.csv --logfile spider2.log

注意:这是针对这个堆栈溢出帖子(2018)的前一个问题的扩展。

结果:spider1.csv的数据来自spider1, spider2.csv的数据来自spider2.

目前结果:spider1.csv的数据来自spider1, spider2.csv中断,但错误日志中包含spider2的数据,并且有一个keyerror ['header1'],即使spider2的项目不包括header1,它只包括header_a

有没有人知道如何在不同的url上运行一个接一个的蜘蛛,并将spider1, spider2等获取的数据插入到管道特定的到该蜘蛛,如spider1 ->spider1Pipeline→Spider1.csv, spider2 ->spider2Pipelines→spider2.csv .

或者这可能是在items.py中指定spider1_itemspider2_item的问题?我想知道我是否可以指定在哪里插入spider2的数据的方式。

谢谢!

您可以使用custom_settings spider属性来为每个spider单独设置设置

#spider2.py
class OneSpider(scrapy.Spider):
    name = "spider1"
    custom_settings = {
        'ITEM_PIPELINES': {'WebScraper.pipelines.spider1_pipelines': 300}
...
class TwoSpider(scrapy.Spider):
    name = "spider2"
    custom_settings = {
        'ITEM_PIPELINES': {'WebScraper.pipelines.spider2_pipelines': 300}
...

相关内容

最新更新