Python - urllib3爬行时出现ClosedPoolError



我正在用python3urllib3构建一个爬虫。我使用的是由15个不同的线程使用的PoolManager实例。在爬成千上万的网站时,我从不同的网站得到了很多ClosedPoolError

文档- ClosedPoolError:

当一个请求在池被关闭后进入池时引发。

PoolManager实例正在尝试使用一个已关闭的连接。

我代码:

from urllib3 import PoolManager, util, Retry
from urllib3.exceptions import MaxRetryError
# Instance of PoolManager is started on init
manager = PoolManager(num_pools=15,
                      maxsize=6,
                      timeout=40.0,
                      retries=Retry(connect=2, read=2, redirect=10))
# Every thread execute download by using the pool manager instance
url_to_download = "**"
headers = util.make_headers(accept_encoding='gzip, deflate',
                            keep_alive=True,
                            user_agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0")
headers['Accept-Language'] = "en-US,en;q=0.5"
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
try:
   response = manager.request('GET',
                                   url_to_download,
                                   preload_content=False,
                                   headers=headers)
except MaxRetryError as ex:
   raise FailedToDownload()

如何使PoolManager重新连接并重试?

有同样的问题,我最终修补了_get_conn来解锁自己,但这远非一个理想的解决方案:

def _get_conn(self, timeout=None):
    conn = None
    try:
        conn = self.pool.get(block=self.block, timeout=timeout)
    except AttributeError:  # self.pool is None
        return self._new_conn()
    except queue.Empty:
        if self.block:
            raise EmptyPoolError(
                self, "Pool reached maximum size and no more connections are allowed.",
            )
        pass
    if conn and is_connection_dropped(conn):
        log.debug("Resetting dropped connection: %s", self.host)
        conn.close()
        if getattr(conn, "auto_open", 1) == 0:
            conn = None
    return conn or self._new_conn()

connectionpool.HTTPConnectionPool._get_conn = _get_conn

相关内容

  • 没有找到相关文章

最新更新