如何在python中将seccond值传递给scratch爬网程序



我将第二个变量(ID(传递给我的抓取器。作为一个蟒蛇初学者,我有点被困在这里了。这怎么可能呢?这是我的代码:

获取所有需要的值(ID、URL(:

# SQL pseudo code: get values:
SELECT
ID,
URL
...

将所有URL附加到start_URLs。我知道行[0]将是ID,但如何将其与URL关联?

results = curb.fetchall()
for row in results: start_urls.append(row[1])

使用URL启动请求时,我需要将ID和相应的URL一起传递,这样我就可以在稍后的代码中使用self访问它。ID。

def start_requests(self):
for url in self.start_urls:
if validators.url(url):
yield scrapy.Request(
# ID=ID,
url=url,
meta={'handle_httpstatus_list': [301, 302]},
callback=self.parse_item,
)
else:
print("Invalid URL ", format(url))

您应该在start_requests()中执行此操作,然后可以在Request()中使用meta={'id': ID, ...}将此值发送到parse_item()

def start_requests(self):
results = curb.fetchall()
#for url in self.start_urls:
for row in results:
url = row[0]
ID = row[1] 
if validators.url(url):
yield scrapy.Request(
url=url,
meta={'ID': ID, 'handle_httpstatus_list': [301, 302]},
callback=self.parse_item,
)
else:
print("Invalid URL ", format(url))

稍后您可以在parse_item()中获得它

def parse_item(self, response):
ID = reponse.meta['ID']

编辑:

如果您在start_urls中没有其他URL,那么您甚至可以使用它。您可以在start_urls中保留rows

results = curb.fetchall()
for row in results: 
start_urls.append(row)
def start_requests(self):
#for row in results:
for row in self.start_urls:
url = row[0]
ID = row[1] 
if validators.url(url):
yield scrapy.Request(
url=url,
meta={'ID': ID, 'handle_httpstatus_list': [301, 302]},
callback=self.parse_item,
)
else:
print("Invalid URL ", format(url))

你甚至可以直接将其分配给

start_urls = curb.fetchall()
def start_requests(self):

for row in self.start_urls:
url = row[0]
ID = row[1] 
# ... code ...

它应该添加到meta参数中。

def start_requests(self):
for url in self.start_urls:
if validators.url(url):
yield scrapy.Request(
# ID=ID,
url=url,
meta={'handle_httpstatus_list': [301, 302]}, <<<< Add ID here.
callback=self.parse_item,
)
else:
print("Invalid URL ", format(url))

正确的方法是:

def start_requests(self):
for url in self.start_urls:
if validators.url(url):
yield scrapy.Request(
url=url,
meta={'ID': ID, 'handle_httpstatus_list': [301, 302]},
callback=self.parse_item,
)
else:
print("Invalid URL ", format(url))

最新更新