当请求设置为每 30 秒 1 个时收到'Max retries exceeded from URL'错误(目标网站机器人.txt需要)



我正在尝试抓取 www.eliteprospects.com,这是一个曲棍球统计网站,提供青少年(16-20)球员的球员统计数据。 当我运行python脚本时,我收到无法找到解决方案的错误。

我已经通读了之前许多与"url 超过最大重试次数"相关的 Stackoverflow 问题,但似乎没有一个适合我的特定问题。 www.eliteprospects.com,我正在尝试抓取的网站有一个机器人.txt页面,请求限制为每 30 秒 1 个。 在我的requests.get(url)调用后,我在代码中输入了一个sleep(30)行,但仍然收到错误。

在之前关于stackoverflow的问题中,其中许多与我的问题相关,有时非常相似,但是当我实施提供的任何解决方案时,我不断收到这些相同的错误。

我不确定我做错了什么,是我的代码吗?是网站吗? 我是否在早期运行了具有太多请求的爬虫并被禁止?我的 for 循环中是否有一些我看不到的东西,因为我盯着它看得太久了?我不知道。。。。 请帮忙。

干杯

我收到的错误...

Traceback (most recent call last):
File "D:AnalyticsEliteProspectsvenvlibsite-packagesurllib3connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "D:AnalyticsEliteProspectsvenvlibsite-packagesurllib3utilconnection.py", line 57, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "C:UsersTPCalAppDataLocalProgramsPythonPython37-32libsocket.py", line 748, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:AnalyticsEliteProspectsvenvlibsite-packagesurllib3connectionpool.py", line 600, in urlopen
chunked=chunked)
File "D:AnalyticsEliteProspectsvenvlibsite-packagesurllib3connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:UsersTPCalAppDataLocalProgramsPythonPython37-32libhttpclient.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:UsersTPCalAppDataLocalProgramsPythonPython37-32libhttpclient.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:UsersTPCalAppDataLocalProgramsPythonPython37-32libhttpclient.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:UsersTPCalAppDataLocalProgramsPythonPython37-32libhttpclient.py", line 1016, in _send_output
self.send(msg)
File "C:UsersTPCalAppDataLocalProgramsPythonPython37-32libhttpclient.py", line 956, in send
self.connect()
File "D:AnalyticsEliteProspectsvenvlibsite-packagesurllib3connection.py", line 181, in connect
conn = self._new_conn()
File "D:AnalyticsEliteProspectsvenvlibsite-packagesurllib3connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x0E241110>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:AnalyticsEliteProspectsvenvlibsite-packagesrequestsadapters.py", line 449, in send
timeout=timeout
File "D:AnalyticsEliteProspectsvenvlibsite-packagesurllib3connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "D:AnalyticsEliteProspectsvenvlibsite-packagesurllib3utilretry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.elitepospects.com', port=80): Max retries exceeded with url: /league/whl/stats/2005-2006?sort=tp&page=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0E241110>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/Analytics/EliteProspects/EliteProspects.py", line 42, in <module>
headers = headers)
File "D:AnalyticsEliteProspectsvenvlibsite-packagesrequestsapi.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "D:AnalyticsEliteProspectsvenvlibsite-packagesrequestsapi.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "D:AnalyticsEliteProspectsvenvlibsite-packagesrequestssessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "D:AnalyticsEliteProspectsvenvlibsite-packagesrequestssessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "D:AnalyticsEliteProspectsvenvlibsite-packagesrequestsadapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.elitepospects.com', port=80): Max retries exceeded with url: /league/whl/stats/2005-2006?sort=tp&page=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0E241110>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

我的蟒蛇脚本...

from requests import get
from bs4 import BeautifulSoup
from time import time
from time import sleep
from IPython.core.display import clear_output
from warnings import warn
import pandas as pd

leagues = ['whl', 'ohl', 'qmjhl']
leagues_url = [str(i) for i in leagues]
seasons = ['2005-2006', '2006-2007', '2007-2008', '2008-2009', '2009-2010',
'2010-2011', '2011-2012', '2012-2013', '2013-2014', '2014-2015',
'2016-2017', '2017-2018']
seasons_url = [str(i) for i in seasons]
pages = [str(i) for i in range(1, 5)]
players = []
games_played = []
goals = []
assists = []
penalty_minutes = []
plus_minus = []
start_time = time()
requests = 0
for league in leagues_url:
for season in seasons_url:
for page in pages:
response = get('http://www.elitepospects.com/league/'
+ league
+ '/stats/'
+ season
+ '?sort=tp&page='
+ page)
sleep(30)
requests += 1
elapsed_time = time() - start_time
print('Requests: {}; Frequency: {} requests/s'.format(requests, requests / elapsed_time))
clear_output(wait=True)
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
if requests > 180:
warn('Number of requests was greater than expected.')
break
page_html = BeautifulSoup(response.text, 'html.parser')
table = page_html.find('div', {'id': 'skater-stats'})
table_rows = table.find_all('tr')
for tr in table_rows:
if tr.find('td', {'style': 'white-space: nowrap;'}) is not None:
player = tr.span.a.text
players.append(player)
gp = tr.find('td', {'class': 'gp'}).text
games_played.append(int(gp))
g = tr.find('td', {'class': 'g'})
goals.append(int(g))
a = tr.find('td', {'class': 'a'})
assists.append(a)
pim = tr.find('td', {'class': 'pim'})
penalty_minutes.append(int(pim))
pm = tr.find('td', {'class': 'pm'})
plus_minus.append(int(pm))
player_stats = pd.DataFrame({'player_name': players,
'gp': games_played,
'g': goals,
'a': assists,
'pim': penalty_minutes,
'plus_minus': plus_minus})
print(player_stats.info())
print(player_stats.describe())
print(player_stats.head(10))
player_stats.to_csv('CHL_player_stats.csv', index=False)

我认为这是一个拼写错误的问题:

招股说明书在您的脚本中拼写错误

精英观点->精英前景

from requests import get
from bs4 import BeautifulSoup
from time import time
from time import sleep
from IPython.core.display import clear_output
from warnings import warn
import pandas as pd

leagues = ['whl', 'ohl', 'qmjhl']
leagues_url = [str(i) for i in leagues]
seasons = ['2005-2006', '2006-2007', '2007-2008', '2008-2009', '2009-2010',
'2010-2011', '2011-2012', '2012-2013', '2013-2014', '2014-2015',
'2016-2017', '2017-2018']
seasons_url = [str(i) for i in seasons]
pages = [str(i) for i in range(1, 5)]
players = []
games_played = []
goals = []
assists = []
penalty_minutes = []
plus_minus = []
start_time = time()
requests = 0
for league in leagues_url:
for season in seasons_url:
for page in pages:
print(page)
print('https://www.elitepospects.com/league/'
+ str(league)
+ '/stats/'
+ str(season)
+ '?sort=tp&page='
+ str(page))
response = get('https://www.eliteprospects.com/league/'
+ str(league)
+ '/stats/'
+ str(season)
+ '?sort=tp&page='
+ str(page))
sleep(30)
requests += 1
elapsed_time = time() - start_time
print('Requests: {}; Frequency: {} requests/s'.format(requests, requests / elapsed_time))
clear_output(wait=True)
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
if requests > 180:
warn('Number of requests was greater than expected.')
break
page_html = BeautifulSoup(response.text, 'html.parser')
table = page_html.find('div', {'id': 'skater-stats'})
table_rows = table.find_all('tr')
for tr in table_rows:
if tr.find('td', {'style': 'white-space: nowrap;'}) is not None:
player = tr.span.a.text
players.append(player)
gp = tr.find('td', {'class': 'gp'}).text
games_played.append(int(gp))
g = tr.find('td', {'class': 'g'})
goals.append(int(g))
a = tr.find('td', {'class': 'a'})
assists.append(a)
pim = tr.find('td', {'class': 'pim'})
penalty_minutes.append(int(pim))
pm = tr.find('td', {'class': 'pm'})
plus_minus.append(int(pm))
player_stats = pd.DataFrame({'player_name': players,
'gp': games_played,
'g': goals,
'a': assists,
'pim': penalty_minutes,
'plus_minus': plus_minus})
print(player_stats.info())
print(player_stats.describe())
print(player_stats.head(10))
player_stats.to_csv('CHL_player_stats.csv', index=False)

编辑:

我想你会发现找到玩家标签也存在问题,我还禁用了明文以更好地查看 Jupyter 笔记中的日志记录。

from requests import get
from bs4 import BeautifulSoup
from time import time
from time import sleep
#from IPython.core.display import clear_output
from warnings import warn
import pandas as pd
#from selenium import webdriver

leagues = ['whl', 'ohl', 'qmjhl']
leagues_url = [str(i) for i in leagues]
seasons = ['2005-2006', '2006-2007', '2007-2008', '2008-2009', '2009-2010',
'2010-2011', '2011-2012', '2012-2013', '2013-2014', '2014-2015',
'2016-2017', '2017-2018']
seasons_url = [str(i) for i in seasons]
pages = [str(i) for i in range(1, 5)]
players = []
games_played = []
goals = []
assists = []
penalty_minutes = []
plus_minus = []
start_time = time()
requests = 0
for league in leagues_url:
for season in seasons_url:
for page in pages:
print(page)
print('https://www.eliteprospects.com/league/'
+ str(league)
+ '/stats/'
+ str(season)
+ '?sort=tp&page='
+ str(page))

response = get('https://www.eliteprospects.com/league/'
+ str(league)
+ '/stats/'
+ str(season)
+ '?sort=tp&page='
+ str(page))
sleep(3)
requests += 1
elapsed_time = time() - start_time
print('Requests: {}; Frequency: {} requests/s'.format(requests, requests / elapsed_time))
#clear_output(wait=True)
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
if requests > 180:
warn('Number of requests was greater than expected.')
break
page_html = BeautifulSoup(response.text, 'html.parser')
table = page_html.find('div', {'id': 'skater-stats'})
table_rows = table.find_all('tr')
#for td in table.find_all('td',{'class': 'player'}):
#    print(td.text)
for tr in table_rows:
if tr.find('td', {'style': 'white-space: nowrap;'}) is not None:
#print(tr.s)
try: 
#player = tr.span.a.text
#players.append(player)
player = tr.find('td', {'class': 'player'}).text
players.append(player)
print(player)
gp = tr.find('td', {'class': 'gp'}).text
games_played.append(str(gp))
g = tr.find('td', {'class': 'g'})
goals.append(str(g))
a = tr.find('td', {'class': 'a'})
assists.append(a)
pim = tr.find('td', {'class': 'pim'})
penalty_minutes.append(str(pim))
pm = tr.find('td', {'class': 'pm'})
plus_minus.append(str(pm))
except:
print('X')
player_stats = pd.DataFrame({'player_name': players,
'gp': games_played,
'g': goals,
'a': assists,
'pim': penalty_minutes,
'plus_minus': plus_minus})
print(player_stats.info())
print(player_stats.describe())
print(player_stats.head(10))
player_stats.to_csv('CHL_player_stats.csv', index=False)

最新更新