抓取中心/Zyte:延迟中未处理的错误:没有名为'scrapy_user_agents'的模块



我通过我的本地机器部署我的Scrapy蜘蛛到Zyte Cloud(以前的ScrapingHub)。这是成功的。当我运行爬行器时,我得到如下输出:

我已经检查过了。似乎Zyte团队在他们自己的网站上反应不是很好,但我发现开发人员在这里通常更活跃:)

我的scrapinghub.yml是这样的:

projects:    
default: <myid>    
requirements:    
file: requirements.txt

我尝试将这些行添加到requirements.txt中,但是,无论我使用哪一行,都会生成相同输出的相同错误。

  • git+git://github.com/scrapedia/scrapy-useragents
  • git+git://github.com/scrapedia/scrapy-useragents.git
  • git+https://github.com/scrapedia/scrapy-useragents.git

我做错了什么?顺便说一句:当我在本地机器上运行它时,这个蜘蛛可以工作。

File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 177, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 181, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 89, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 103, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python3.8/site-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/__init__.py", line 83, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python3.8/site-packages/scrapy/middleware.py", line 53, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python3.8/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python3.8/site-packages/scrapy/utils/misc.py", line 50, in load_object
mod = import_module(module)
File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
builtins.ModuleNotFoundError: No module named 'scrapy_user_agents'

更新1

使用@Thiago Curvelo的建议

好吧,奇怪的事情发生了。

当我在本地运行蜘蛛时,以下代码对我有效:

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

然后我把它改为scrapy_useragents按照你的建议:

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}

在本地运行时出现错误:

ModuleNotFoundError: No module named 'scrapy_useragents'

然而,我也部署到Zyteshub deploy <myid>

当在Zyte上运行时,我现在得到不同的错误,特别是:

Connection was refused by other side: 111: Connection refused.

我不知道这里发生了什么事?

My log (CSV下载):

time,level,message
01-10-2021 08:57,INFO,Log opened.
01-10-2021 08:57,INFO,[scrapy.utils.log] Scrapy 2.0.0 started (bot: foobar)
01-10-2021 08:57,INFO,"[scrapy.utils.log] Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.2 (default, Feb 26 2020, 15:09:34) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-72-generic-x86_64-with-glibc2.2.5"
01-10-2021 08:57,INFO,"[scrapy.crawler] Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'BOT_NAME': 'foobar',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'LOG_ENABLED': False,
'LOG_LEVEL': 'INFO',
'MEMUSAGE_LIMIT_MB': 950,
'NEWSPIDER_MODULE': 'foobar.spiders',
'SPIDER_MODULES': ['foobar.spiders'],
'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector',
'TELNETCONSOLE_HOST': '0.0.0.0'}"
01-10-2021 08:57,INFO,[scrapy.extensions.telnet] Telnet Password: <password>
01-10-2021 08:57,INFO,"[scrapy.middleware] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.spiderstate.SpiderState',
'scrapy.extensions.throttle.AutoThrottle',
'scrapy.extensions.debug.StackTraceDump',
'sh_scrapy.extension.HubstorageExtension']"
01-10-2021 08:57,INFO,"[scrapy.middleware] Enabled downloader middlewares:
['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'sh_scrapy.middlewares.HubstorageDownloaderMiddleware']"
01-10-2021 08:57,INFO,"[scrapy.middleware] Enabled spider middlewares:
['sh_scrapy.diskquota.DiskQuotaSpiderMiddleware',
'sh_scrapy.middlewares.HubstorageSpiderMiddleware',
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']"
01-10-2021 08:57,INFO,"[scrapy.middleware] Enabled item pipelines:
[]"
01-10-2021 08:57,INFO,[scrapy.core.engine] Spider opened
01-10-2021 08:57,INFO,"[scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)"
01-10-2021 08:57,INFO,[scrapy_useragents.downloadermiddlewares.useragents] Load 0 user_agents from settings.
01-10-2021 08:57,INFO,TelnetConsole starting on 6023
01-10-2021 08:57,INFO,[scrapy.extensions.telnet] Telnet console listening on 0.0.0.0:6023
01-10-2021 08:57,WARNING,"[py.warnings] /usr/local/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)
"
01-10-2021 08:57,ERROR,[scrapy.downloadermiddlewares.retry] Gave up retrying <GET https://www.example.com/allobjects via http://localhost:8050/execute> (failed 3 times): Connection was refused by other side: 111: Connection refused.
01-10-2021 08:57,ERROR,"[scrapy.core.scraper] Error downloading <GET https://www.example.com/allobjects via http://localhost:8050/execute>
Traceback (most recent call last):
File ""/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py"", line 42, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused."
01-10-2021 08:57,INFO,[scrapy.core.engine] Closing spider (finished)
01-10-2021 08:57,INFO,"[scrapy.statscollectors] Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3,
'downloader/request_bytes': 3813,
'downloader/request_count': 3,
'downloader/request_method_count/POST': 3,
'elapsed_time_seconds': 12.989914,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 10, 1, 8, 57, 26, 273397),
'log_count/ERROR': 2,
'log_count/INFO': 11,
'log_count/WARNING': 1,
'memusage/max': 62865408,
'memusage/startup': 62865408,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/disk': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/disk': 4,
'splash/execute/request_count': 1,
'start_time': datetime.datetime(2021, 10, 1, 8, 57, 13, 283483)}"
01-10-2021 08:57,INFO,[scrapy.core.engine] Spider closed (finished)
01-10-2021 08:57,INFO,Main loop terminated.

您的中间件设置似乎有一个错别字。Scrapy正在寻找一个名为scrapy_user_agents的模块,但正确的名称是scrapy_useragents

settings.py中再次检查DOWNLOADER_MIDDLEWARES的内容。它应该看起来像这样:

DOWNLOADER_MIDDLEWARES = {
# ...
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}

最新更新