如何直接发送 twisted.internet.error.TimeoutError 而不请求

我有一个包含数千个URL的数据库，我用一个蜘蛛来抓取。例如，100 URL s 可以具有相同的域：

http://notsame.com/1
http://notsame2.com/1
http://dom.com/1
http://dom.com/2
http://dom.com/3
...

问题是有时网页/域什么都不返回，所以我得到了<twisted.python.failure.Failure twisted.internet.error.TimeoutError: User timeout caused connection failure:.对于域的所有URL都是相同的。

例如，我想检测同一域的 5 个 url 超时，然后如果我确定此主机存在一些问题，请避免再请求此域并直接提出<twisted.python.failure.Failure twisted.internet.error.TimeoutError: User timeout caused connection failure:

可能吗？如果是，如何？

编辑：

我的想法(在rrschmidt的帮助下编辑(：

class TimeoutProcessMiddleware:
     _timeouted_domains = set()
    def process_request(request,spider):
        domain = get_domain(request.url)
        if domain in _timeouted_domains:
            return twisted.internet.error.TimeoutError
        return request
    def process_response(request, exception, spider):
        # left out the code for counting timeouts for clarity
        if is_timeout_exception(exception):
            self._timeouted_domains.add(get_domain(request.url))

您构建

TimeoutProcessMiddleware的想法走在正确的轨道上。更具体地说，我会将其构建为下载器中间件。

下载器中间件可以触摸每个传出请求以及每个传入响应...和。。。它还可以处理在处理请求/响应时弹出的每个异常。详细信息：https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

所以我会做什么(在未经测试的要点中，可能需要一些微调(：

class TimoutProcessMiddleware(scrapy.downloadermiddlewares.DownloaderMiddleware):
    _timeouted_domains = set()
    def process_request(request, spider):
        domain = get_domain(request.url)
        if domain in self._timeouted_domains:
            raise IgnoreRequest():
    def process_response(request, exception, spider):
        # left out the code for counting timeouts for clarity
        if is_timeout_exception(exception):
            self._timeouted_domains.add(get_domain(request.url))

相关内容

最新更新

热门标签：