python scrapy回调函数错误



我必须爬很多网站,有办法吗?

我尝试过的代码在回调函数中出现错误,但我无法解决。有什么方法可以使我的代码可用,或者以列表格式进行回调吗?

谢谢。

import scrapy
from ..items import AppItem
urls = {
'fun1': 'http://example1.com',
'fun2': 'https://example2.com',
# to add link
# to add link ...
}
item = AppItem()

class Bot(scrapy.Spider):
name = 'app'
def start_requests(self):
for cb in urls:
yield scrapy.Request(url=urls[cb], callback=cb)

def fun1(self, response):
item['title'] = response.css('title')
yield item
def fun2(self, response):
item['title'] = response.css('title')
yield item

错误

C:/Python310/python.exe c:/zCode/News/newsScraper/startApp.py
2021-11-26 03:09:06 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "C:Python310libsite-packagesscrapycoreengine.py", line 129, in _next_request    
request = next(slot.start_requests)
File "c:zCodeNewsnewsScrapernewsScraperspidersapp.py", line 18, in start_requests    
yield scrapy.Request(url=urls[cb], callback=cb)
File "C:Python310libsite-packagesscrapyhttprequest__init__.py", line 32, in __init__
raise TypeError(f'callback must be a callable, got {type(callback).__name__}')
TypeError: callback must be a callable, got str
:  [<Selector xpath='descendant-or-self::title' data='<title>Daum</title>'>]

有一个更简单的解决方案,通过使用考虑dict中键的加载器。这包括它们,并且不必为每个url使用单独的函数。

from scrapy.loader import ItemLoader
from scrapy.item import Field
from itemloaders.processors import TakeFirst
import scrapy

class BotItem(scrapy.Item):
objects = Field(output_processor = TakeFirst())
fun = Field(output_processor = TakeFirst())

class Bot(scrapy.Spider):
name = 'app'

start_urls = {
'fun1': 'http://example1.com',
'fun2': 'https://example2.com',
# to add link
# to add link ...
}
def start_requests(self):
for keys,url in self.start_urls.items():
yield scrapy.Request(
url, 
callback=self.fun1,
cb_kwargs = {
'keys':keys
})

def fun1(self, response, keys):
item = response.xpath('//div[@class="container"]')
for stuff in item:
l = ItemLoader(BotItem(), selector = stuff)
l.add_value('fun', keys)
l.add_xpath('objects', '//div[@class="some_objects"]//test()')
yield l.load_item()

正如异常所说,您正在传递一个字符串作为回调,而需要一个可调用的字符串。

这意味着这样做反而会奏效:

def start_requests(self):
for cb in urls:
yield scrapy.Request(url=urls[cb], callback=self.fun1)

既然你提供了你的url和你想在代码中使用的回调,我建议你放弃你的urlsdict,直接产生你的请求,而不需要循环:

def start_requests(self):
yield scrapy.Request(url='http://example1.com', callback=self.fun1)
yield scrapy.Request(url='http://example2.com', callback=self.fun2)
...

这至少是最简单的解决方案。如果您坚持以字符串形式调用方法,那么可能需要使用getattr。为此,请检查此SO问题

最新更新