如何在 scrapy 中使用来自 2 种不同方法的 2 个产量项目?

  • 本文关键字:方法 项目 scrapy python scrapy
  • 更新时间 :
  • 英文 :


我是python的新手,而且很棘手。我从 2 种不同的方法中产生了 2 个项目,第一个用于第一页数据,第二个用于第二页数据。我无法以相同的顺序保存数据,第二项保存在第一项之后,但我需要一次保存两者。 提前谢谢。

class FirstPipeline(object):
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
current_date = datetime.now().strftime("%Y%m%d")
filename = 'License_Vehicle Inspection Stations_NY_CurationReady_' + current_date +'_v1.csv'
self.file = open(filename, 'w+b')
self.exporter = CsvItemExporter(self.file, delimiter = '|')
self.exporter.csv_writer.writerow(["Premises Name","Principal's Name","","Trade Name","Zone","County","Address/zone","License Class","License Type Code","License Type","Expiration Date","License Status","Serial Number","Credit Group","Filing Date","Effective Date"," "," "," "," "])
self.exporter.fields_to_export = ["company_name","mixed_name","mixed_subtype","dba_name","zone","county","location_address_string","licence_class","licence_type_code","permity_subtype","permit_lic_exp_date","permit_licence_status","permit_lic_no","credit_group","permit_lic_eff_date","permit_applied_date","permit_type","url","source_name","ingestion_timestamp"]
self.exporter.start_exporting()

def spider_closed(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
print("got the item in pipeline")
self.exporter.export_item(item)
return item
class SecondPipeline(object):
def process_item(self, item, spider):
if isinstance(item, SecondItem):
pass
return item

Raj725,你的问题对于Scrapy的初学者来说是实际的,可能是在Python中。在阅读 Scrapy 文档之前,我有同样的问题。不阅读文档就不可能理解 Scrapy。您可以开始阅读教程,然后阅读项目部分和管道部分。

这是如何生成多种类型的数据的示例。

1 需要修复您需要 items.py 文件的项目:

from scrapy import Item, Field
class FirstItem(Item):
field_one = Field()
field_two = Field()
class SecondItem(Item):
another_field_one = Field()
another_field_two = Field()
another_field_three = Field()

2 现在您可以在抓取代码中使用项目。可以在任何有要保存数据的地方生成项目:

from ..items import FirstItem, SecondItem
item = FirstItem(
field_one=response.css("div.one span::text").extract_first(),
field_two=response.css("div.two span::text").extract_first()
)
yield item
item = SecondItem(
another_field_one='some variable one',
another_field_one='some variable two',
another_field_three='some variable tree'
)
yield item

3 pipeline.py 文件示例。在保存之前不要忘记检查项目类型。在"process_item"的末尾,您必须退回物品。

from .items import FirstItem, SecondItem
class FirstPipeline(object):
def process_item(self, item, spider):
if isinstance(item, FirstItem):            
# Save your data here. It's possible to save it to CSV file. Also you can put data to any database you need.
pass
return item
class SecondPipeline(object):
def process_item(self, item, spider):
if isinstance(item, SecondItem):            
# Save your data here. It's possible to save it to CSV file. Also you can put data to any database you need.
pass
return item

4 不要忘记在 settings.py 中声明您的管道。没有它,Scrapy 不会使用桩线。

ITEM_PIPELINES = {
'scrapy_project.pipelines.FirstPipeline': 300,
'scrapy_project.pipelines.SecondPipeline': 300,
}

我没有提供现成的代码。我提供了代码示例来了解它是如何工作的。可以将其放入代码并进行所需的更改。 我没有展示如何将项目保存到CSV文件。您可以导入"csv"模块。您也可以在 pipeline.py 中使用 CsvItemExported,从"scrapy.exporters"。我提供了有关如何将不同项目保存到不同CSV文件的示例的链接。

相关内容

最新更新