如何从 items.py 获取/导入 Scrapy 项目列表 pipelines.py?

在我的items.py中：

class NewAdsItem(Item):
AdId        = Field()
DateR       = Field()
AdURL       = Field()

在我的pipelines.py：

import sqlite3
from scrapy.conf import settings
con = None
class DbPipeline(object):
def __init__(self):
self.setupDBCon()
self.createTables()
def setupDBCon(self):
# This is NOT OK!
# I want to get the items already HERE!
dbfile = settings.get('SQLITE_FILE')
self.con = sqlite3.connect(dbfile)
self.cur = self.con.cursor()
def createTables(self):
# OR optionally HERE.
self.createDbTable()
...
def process_item(self, item, spider):
self.storeInDb(item)
return item
def storeInDb(self, item):
# This is OK, I CAN get the items in here, using: 
# item.keys() and/or item.values()
sql = "INSERT INTO {0} ({1}) VALUES ({2})".format(self.dbtable, ','.join(item.keys()), ','.join(['?'] * len(item.keys())) )
...

在执行process_item()(pipelines.py(之前，如何从items.py获取项目列表名称(如"AdId"等(？

我使用scrapy runspider myspider.py执行。

我已经尝试像这样添加"项目"和/或"蜘蛛"def setupDBCon(self, item)，但这不起作用，并导致：TypeError: setupDBCon() missing 1 required positional argument: 'item'

更新日期： 2018-10-08

结果 (A(：

部分遵循@granitosaurus的解决方案，我发现我可以通过以下方式将项目键作为列表获取：

添加(a(：from adbot.items import NewAdsItem到我的主蜘蛛代码。
增加(b(：ikeys = NewAdsItem.fields.keys()在上述类别中。
然后，我可以通过以下方式从我的pipelines.py访问密钥：

def open_spider(self, spider):
self.ikeys = list(spider.ikeys)
print("Keys in pipelines: t%s" % ",".join(self.ikeys) )
#self.createDbTable(ikeys)

但是，此方法存在 2 个问题：

我无法将ikeys列表放入createDbTable().(我不断收到关于这里和那里缺少参数的错误。
ikeys列表(如检索到的那样(被重新排列，并且没有保持项目的顺序，因为它们出现在items.py中，这部分破坏了目的。我仍然不明白为什么这些是乱序的，当所有文档都说 Python3 应该保持字典和列表等的顺序时。同时，当使用process_item()并通过以下方式获取物品时：item.keys()它们的顺序保持不变。

结果 (B(：

归根结底，修复 (A( 太费力且太复杂，所以我只是将相关的items.py类导入到我的pipelines.py中，并将项目列表用作全局变量，如下所示：

def createDbTable(self):
self.ikeys = NewAdsItem.fields.keys()
print("Keys in creatDbTable: t%s" % ",".join(self.ikeys) )
...

在这种情况下，我只是决定接受获得的列表似乎是按字母顺序排序的，并通过更改键名称来解决此问题。(作弊！

这是令人失望的，因为代码是丑陋和扭曲的。任何更好的建议将不胜感激。

Scrapy 管道有 3 个连接的方法：

process_item(self, item, spider)
为每个项管道组件调用此方法。 process_item(( 必须：返回包含数据的字典、返回 Item(或任何后代类(对象、返回扭曲延迟或引发 DropItem 异常。丢弃的项不再由其他管道组件处理。

open_spider(self, spider)打开蜘蛛时调用此方法。

close_spider(self, spider)当蜘蛛关闭时调用此方法。

https://doc.scrapy.org/en/latest/topics/item-pipeline.html

因此，您只能process_item方法中访问项目。

但是，如果您想获取物品类，则可以将其附加到蜘蛛类：

class MySpider(Spider):
item_cls = MyItem
class MyPipeline:
def open_spider(self, spider):
fields = spider.item_cls.fields
# fields is a dictionary of key: default value
self.setup_table(fields)

另一种方法是process_item方法本身期间延迟加载：

class MyPipeline:
item = None
def process_item(self, item, spider):
if not self.item:
self.item = item
self.setup_table(item)

相关内容

最新更新

热门标签：