为什么超时在请求库中不起作用?



我想解析一些来自许多网站的数据。我的代码的一部分是一个函数,它从请求的URL获取数据。

这是我的函数,如您所见,我为get函数设置了超时。

import requests, re
from lxml import html
from requests_html import HTMLSession
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
def get_source(url):
try:
session = HTMLSession()
retry = Retry(connect=0, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
response = session.get(url, verify=False, timeout=0.5)
#response = session.get(url, verify=False, timeout=(0.5, 0.5))
return response
except requests.exceptions.RequestException as e:
print(e)
return None

但是当我调用这个函数时,例如下面的URL,函数的执行时间大于300秒。

https://www.bjcta.org/wp-content/uploads/2021/02/Unified-Certification-Program-DBEs-Alabama.xls

我不知道主要问题是什么,也不知道如何设置超时以防止增加执行时间。

我检查了你的代码,一切都很完美。
如果我改变timeout=0.1,它会引发异常。
我唯一改变的是

# this 
from requests.packages.urllib3.util.retry import Retry
# to this
from urllib3.util.retry import Retry

我的代码(实际上是你的):

def get_source(url):
try:
session = HTMLSession()
retry = Retry(connect=0, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
print("start")
response = session.get(url, verify=False, timeout=0.5)
# response = session.get(url, verify=False, timeout=(0.5, 0.5))
print("done")
return response
except requests.exceptions.RequestException as e:
print(e)
return None
url = "https://www.bjcta.org/wp-content/uploads/2021/02/Unified-Certification-Program-DBEs-Alabama.xls" 
print(get_source(URL))

Otuput:

$ python stackOverflow/requests_timeout.py 
start
/home/gad/notes/venv/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.bjcta.org'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
done
<Response [200]>

withtimeout=0.1

$ python stackOverflow/requests_timeout.py 
start
HTTPSConnectionPool(host='www.bjcta.org', port=443): Max retries exceeded with url: /wp-content/uploads/2021/02/Unified-Certification-Program-DBEs-Alabama.xls (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f8b11e056d0>, 'Connection to www.bjcta.org timed out. (connect timeout=0.1)'))
None

最新更新