我正在用python3
和urllib3
构建一个爬虫。我使用的是由15个不同的线程使用的PoolManager
实例。在爬成千上万的网站时,我从不同的网站得到了很多ClosedPoolError
。
文档- ClosedPoolError
:
当一个请求在池被关闭后进入池时引发。
PoolManager
实例正在尝试使用一个已关闭的连接。
from urllib3 import PoolManager, util, Retry
from urllib3.exceptions import MaxRetryError
# Instance of PoolManager is started on init
manager = PoolManager(num_pools=15,
maxsize=6,
timeout=40.0,
retries=Retry(connect=2, read=2, redirect=10))
# Every thread execute download by using the pool manager instance
url_to_download = "**"
headers = util.make_headers(accept_encoding='gzip, deflate',
keep_alive=True,
user_agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0")
headers['Accept-Language'] = "en-US,en;q=0.5"
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
try:
response = manager.request('GET',
url_to_download,
preload_content=False,
headers=headers)
except MaxRetryError as ex:
raise FailedToDownload()
如何使PoolManager
重新连接并重试?
有同样的问题,我最终修补了_get_conn来解锁自己,但这远非一个理想的解决方案:
def _get_conn(self, timeout=None):
conn = None
try:
conn = self.pool.get(block=self.block, timeout=timeout)
except AttributeError: # self.pool is None
return self._new_conn()
except queue.Empty:
if self.block:
raise EmptyPoolError(
self, "Pool reached maximum size and no more connections are allowed.",
)
pass
if conn and is_connection_dropped(conn):
log.debug("Resetting dropped connection: %s", self.host)
conn.close()
if getattr(conn, "auto_open", 1) == 0:
conn = None
return conn or self._new_conn()
connectionpool.HTTPConnectionPool._get_conn = _get_conn