我必须爬很多网站,有办法吗?
我尝试过的代码在回调函数中出现错误,但我无法解决。有什么方法可以使我的代码可用,或者以列表格式进行回调吗?
谢谢。
import scrapy
from ..items import AppItem
urls = {
'fun1': 'http://example1.com',
'fun2': 'https://example2.com',
# to add link
# to add link ...
}
item = AppItem()
class Bot(scrapy.Spider):
name = 'app'
def start_requests(self):
for cb in urls:
yield scrapy.Request(url=urls[cb], callback=cb)
def fun1(self, response):
item['title'] = response.css('title')
yield item
def fun2(self, response):
item['title'] = response.css('title')
yield item
错误
C:/Python310/python.exe c:/zCode/News/newsScraper/startApp.py
2021-11-26 03:09:06 [scrapy.core.engine] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "C:Python310libsite-packagesscrapycoreengine.py", line 129, in _next_request
request = next(slot.start_requests)
File "c:zCodeNewsnewsScrapernewsScraperspidersapp.py", line 18, in start_requests
yield scrapy.Request(url=urls[cb], callback=cb)
File "C:Python310libsite-packagesscrapyhttprequest__init__.py", line 32, in __init__
raise TypeError(f'callback must be a callable, got {type(callback).__name__}')
TypeError: callback must be a callable, got str
: [<Selector xpath='descendant-or-self::title' data='<title>Daum</title>'>]
有一个更简单的解决方案,通过使用考虑dict中键的加载器。这包括它们,并且不必为每个url使用单独的函数。
from scrapy.loader import ItemLoader
from scrapy.item import Field
from itemloaders.processors import TakeFirst
import scrapy
class BotItem(scrapy.Item):
objects = Field(output_processor = TakeFirst())
fun = Field(output_processor = TakeFirst())
class Bot(scrapy.Spider):
name = 'app'
start_urls = {
'fun1': 'http://example1.com',
'fun2': 'https://example2.com',
# to add link
# to add link ...
}
def start_requests(self):
for keys,url in self.start_urls.items():
yield scrapy.Request(
url,
callback=self.fun1,
cb_kwargs = {
'keys':keys
})
def fun1(self, response, keys):
item = response.xpath('//div[@class="container"]')
for stuff in item:
l = ItemLoader(BotItem(), selector = stuff)
l.add_value('fun', keys)
l.add_xpath('objects', '//div[@class="some_objects"]//test()')
yield l.load_item()
正如异常所说,您正在传递一个字符串作为回调,而需要一个可调用的字符串。
这意味着这样做反而会奏效:
def start_requests(self):
for cb in urls:
yield scrapy.Request(url=urls[cb], callback=self.fun1)
既然你提供了你的url和你想在代码中使用的回调,我建议你放弃你的urls
dict,直接产生你的请求,而不需要循环:
def start_requests(self):
yield scrapy.Request(url='http://example1.com', callback=self.fun1)
yield scrapy.Request(url='http://example2.com', callback=self.fun2)
...
这至少是最简单的解决方案。如果您坚持以字符串形式调用方法,那么可能需要使用getattr
。为此,请检查此SO问题