如何使用 scrapy 处理超时



我想通过使用DOWNLOADER_MIDDLEWARESprocess_spider_exception来保存超时情况。以下是代码:

class CambridgespiderSpiderMiddleware(object):
    def process_spider_exception(self, response, exception, spider):
        with open(r"error_url.txt", 'a') as f:
            f.write(str(exception) + ': ' + str(response.url))  
        return response

setting.py 是

DOWNLOADER_MIDDLEWARES = {
    'CambridgeSpider.middlewares.CambridgespiderSpiderMiddleware': 543,
}

我使用官方演示来轻松解释我的麻烦:

class CambridgeSpider(CrawlSpider):
    name = "Cambridge"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]
    def start_requests(self):
        for u in self.start_urls:
            yield Request(u, callback=self.parse_httpbin,
                                    dont_filter=True)
    def parse_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))

中间件已成功加载,但我不知道为什么它没有生成文件夹error_url.txt以下是日志:

2017-06-22 16:47:43 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: CambridgeSpider)
2017-06-22 16:47:43 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'CambridgeSpider.spiders', 'FEED_URI': 'Cambridge.csv', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36', 'SPIDER_MODULES': ['CambridgeSpider.spiders'], 'AUTOTHROTTLE_START_DELAY': 3, 'LOG_FILE': 'cambridge.log', 'BOT_NAME': 'CambridgeSpider', 'DOWNLOAD_TIMEOUT': 60, 'RETRY_TIMES': 3, 'FEED_FORMAT': 'csv', 'AUTOTHROTTLE_ENABLED': True, 'DOWNLOAD_DELAY': 2, 'AUTOTHROTTLE_DEBUG': True}
2017-06-22 16:47:43 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.throttle.AutoThrottle']
2017-06-22 16:47:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'CambridgeSpider.middlewares.CambridgespiderSpiderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-22 16:47:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-22 16:47:44 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-22 16:47:44 [scrapy.core.engine] INFO: Spider opened
2017-06-22 16:47:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-22 16:47:44 [Cambridge] INFO: Spider opened: Cambridge
2017-06-22 16:47:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-06-22 16:47:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: no results for hostname lookup: www.httphttpbinbin.org.
2017-06-22 16:47:45 [scrapy.extensions.throttle] INFO: slot: www.httpbin.org | conc: 1 | delay: 2000 ms (-1000) | latency:  644 ms | size: 12793 bytes
2017-06-22 16:47:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)
2017-06-22 16:47:45 [Cambridge] INFO: Got successful response from http://www.httpbin.org/
2017-06-22 16:47:47 [scrapy.extensions.throttle] INFO: slot: www.httpbin.org | conc: 1 | delay: 2000 ms (+0) | latency:  321 ms | size:     0 bytes
2017-06-22 16:47:47 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None)
2017-06-22 16:47:47 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 http://www.httpbin.org/status/404>: HTTP status code is not handled or not allowed
2017-06-22 16:47:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: no results for hostname lookup: www.httphttpbinbin.org.
2017-06-22 16:47:50 [scrapy.extensions.throttle] INFO: slot: www.httpbin.org | conc: 1 | delay: 2000 ms (+0) | latency:  316 ms | size:     0 bytes
2017-06-22 16:47:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error
2017-06-22 16:47:51 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 3 times): DNS lookup failed: no results for hostname lookup: www.httphttpbinbin.org.
2017-06-22 16:47:53 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 4 times): DNS lookup failed: no results for hostname lookup: www.httphttpbinbin.org.
2017-06-22 16:47:53 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.httphttpbinbin.org/>
Traceback (most recent call last):
  File "j:python27libsite-packagestwistedinternetdefer.py", line 1299, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "j:python27libsite-packagestwistedpythonfailure.py", line 393, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "j:python27libsite-packagesscrapycoredownloadermiddleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "j:python27libsite-packagestwistedinternetdefer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "j:python27libsite-packagestwistedinternetendpoints.py", line 838, in startConnectionAttempts
    "no results for hostname lookup: {}".format(self._hostStr)
DNSLookupError: DNS lookup failed: no results for hostname lookup: www.httphttpbinbin.org.
2017-06-22 16:47:54 [scrapy.extensions.throttle] INFO: slot: www.httpbin.org | conc: 2 | delay: 2000 ms (+0) | latency:  346 ms | size:     0 bytes
2017-06-22 16:47:54 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error
2017-06-22 16:47:57 [scrapy.extensions.throttle] INFO: slot: www.httpbin.org | conc: 2 | delay: 2000 ms (+0) | latency:  250 ms | size:     0 bytes
2017-06-22 16:47:57 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 3 times): 500 Internal Server Error
2017-06-22 16:47:59 [scrapy.extensions.throttle] INFO: slot: www.httpbin.org | conc: 2 | delay: 2000 ms (+0) | latency:  250 ms | size:     0 bytes
2017-06-22 16:47:59 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 4 times): 500 Internal Server Error
2017-06-22 16:47:59 [scrapy.core.engine] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None)
2017-06-22 16:47:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://www.httpbin.org/status/500>: HTTP status code is not handled or not allowed
2017-06-22 16:48:11 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time..
2017-06-22 16:48:29 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 2 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time..
2017-06-22 16:48:44 [scrapy.extensions.logstats] INFO: Crawled 3 pages (at 3 pages/min), scraped 0 items (at 0 items/min)
2017-06-22 16:48:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time..
2017-06-22 16:49:07 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 4 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time..
2017-06-22 16:49:07 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.httpbin.org:12345/>: TCP connection timed out: 10060: u7531u4e8eu8fdeu63a5u65b9u5728u4e00u6bb5u65f6u95f4u540eu6ca1u6709u6b63u786eu7b54u590du6216u8fdeu63a5u7684u4e3bu673au6ca1u6709u53cdu5e94uff0cu8fdeu63a5u5c1du8bd5u5931u8d25u3002.
2017-06-22 16:49:07 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-22 16:49:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 8,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 4,
 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 4,
 'downloader/request_bytes': 4124,
 'downloader/request_count': 14,
 'downloader/request_method_count/GET': 14,
 'downloader/response_bytes': 14468,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/500': 4,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 6, 22, 8, 49, 7, 613000),
 'log_count/DEBUG': 16,
 'log_count/ERROR': 2,
 'log_count/INFO': 18,
 'response_received_count': 3,
 'scheduler/dequeued': 14,
 'scheduler/dequeued/memory': 14,
 'scheduler/enqueued': 14,
 'scheduler/enqueued/memory': 14,
 'start_time': datetime.datetime(2017, 6, 22, 8, 47, 44, 413000)}
2017-06-22 16:49:07 [scrapy.core.engine] INFO: Spider closed (finished)

我知道我可以使用

Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)  
def errback_httpbin(self, failure):
    if failure.check(TimeoutError, TCPTimedOutError):
        with open(r"error_url.txt", 'a') as f:
            f.write(str(failure) + ': ' + str(failure.request.url))

以完成相同的工作。但是我原来的蜘蛛使用

rules = (
        Rule(LinkExtractor(allow = (r'/core/journals/ed')),)

它不能打电话回,所以请帮助我。

您可以创建重试中间件。使其成为默认RetryMiddleware的子类,而不是创建新对象。它将看起来像这样:

from scrapy.downloadermiddlewares.retry import RetryMiddleware
from twisted.internet.error import TCPTimedOutError, TimeoutError
class FakeUserAgentErrorRetryMiddleware(RetryMiddleware):
    def process_exception(self, request, exception, spider):   
        if isinstance(exception, TimeoutError) or isinstance(exception, TCPTimedOutError): 
            return self._retry(request, exception, spider)

最新更新