新手刚开始玩scrapy
我的目录结构如下…
#FYI: running on Scrapy 2.4.1
WebScraper/
Webscraper/
spiders/
spider.py # (NOTE: contains spider1 and spider2 classes.)
items.py
middlewares.py
pipelines.py # (NOTE: contains spider1Pipeline and spider2Pipeline)
settings.py # (NOTE: I wrote here:
#ITEM_PIPELINES = {
# 'WebScraper.pipelines.spider1_pipelines': 300,
# 'WebScraper.pipelines.spider2_pipelines': 300,
#}
scrapy.cfg
和spider2.py
相似…
class OneSpider(scrapy.Spider):
name = "spider1"
def start_requests(self):
urls = ["url1.com",]
yield scrapy.Request(
url="http://url1.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
class TwoSpider(scrapy.Spider):
name = "spider2"
def start_requests(self):
urls = ["url2.com",]
yield scrapy.Request(
url="http://url2.com",
callback=self.parse
)
def parse(self,response):
## Scrape stuff, put it in a dict
yield dictOfScrapedStuff
与pipelines.py
看起来像…
class spider1_pipelines(object):
def __init__(self):
self.csvwriter = csv.writer(open('spider1.csv', 'w', newline=''))
self.csvwriter.writerow(['header1', 'header2'])
def process_item(self, item, spider):
row = []
row.append(item['header1'])
row.append(item['header2'])
self.csvwrite.writerow(row)
class spider2_pipelines(object):
def __init__(self):
self.csvwriter = csv.writer(open('spider2.csv', 'w', newline=''))
self.csvwriter.writerow(['header_a', 'header_b'])
def process_item(self, item, spider):
row = []
row.append(item['header_a']) #NOTE: this is not the same as header1
row.append(item['header_b']) #NOTE: this is not the same as header2
self.csvwrite.writerow(row)
我有一个关于使用一个终端命令在不同url上运行spider1和spider2的问题:
nohup scrapy crawl spider1 -o spider1_output.csv --logfile spider1.log & scrapy crawl spider2 -o spider2_output.csv --logfile spider2.log
注意:这是针对这个堆栈溢出帖子(2018)的前一个问题的扩展。
的结果:spider1.csv的数据来自spider1, spider2.csv的数据来自spider2.
目前结果:spider1.csv的数据来自spider1, spider2.csv中断,但错误日志中包含spider2的数据,并且有一个keyerror ['header1']
,即使spider2的项目不包括header1
,它只包括header_a
。
有没有人知道如何在不同的url上运行一个接一个的蜘蛛,并将spider1, spider2等获取的数据插入到管道特定的到该蜘蛛,如spider1 ->spider1Pipeline→Spider1.csv, spider2 ->spider2Pipelines→spider2.csv .
或者这可能是在items.py中指定spider1_item
和spider2_item
的问题?我想知道我是否可以指定在哪里插入spider2的数据的方式。
谢谢!
您可以使用custom_settings
spider属性来为每个spider单独设置设置
#spider2.py
class OneSpider(scrapy.Spider):
name = "spider1"
custom_settings = {
'ITEM_PIPELINES': {'WebScraper.pipelines.spider1_pipelines': 300}
...
class TwoSpider(scrapy.Spider):
name = "spider2"
custom_settings = {
'ITEM_PIPELINES': {'WebScraper.pipelines.spider2_pipelines': 300}
...