url超过了Python最大重试次数



扫描仪一直工作,直到找到一个不再可用的外部地址,然后崩溃。

我只想扫描herold.at并提取电子邮件地址。

我希望他停止扫描外部地址。我试过

r = requests.get ('http://github.com', allow_redirects = False),但不起作用。

import csv
import requests
import re
import time
from bs4 import BeautifulSoup
# Number of pages plus one
allLinks = [];mails=[];
url = 'https://www.herold.at/gelbe-seiten/wien/was_installateur/?page='
for page in range(3):
time.sleep(5)
print('---', page, '---')

response = requests.get(url + str(page), timeout=1.001)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
#time.sleep(15)
if(("Kontakt" in i or "Porträt")):
allLinks.append(i)
allLinks=set(allLinks)
def findMails(soup):
#time.sleep(15)
for name in soup.find_all("a", "ellipsis"):
if(name is not None):
emailText=name.text
match=bool(re.match('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+',emailText))
if('@' in emailText and match==True):
emailText=emailText.replace(" ",'').replace('r','')
emailText=emailText.replace('n','').replace('t','')
if(len(mails)==0)or(emailText not in mails):
print(emailText)
mails.append(emailText)
for link in allLinks:
if(link.startswith("http") or link.startswith("www")):
r=requests.get(link)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
else:
newurl=url+link
r=requests.get(newurl)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
mails=set(mails)
if(len(mails)==0):
print("NO MAILS FOUND")

错误:

requests.exceptions.connection错误:HTTPConnectionPool(host='ww.gebrueder-ember.at',port=80(:超过url的最大重试次数:/(由NewConnectionError('<urlib3.connection.HTTPConnection对象位于0x0000021A24AA7308>引起(:无法建立新连接:[WinError 10060]连接尝试失败,因为连接方在一段时间后没有正确响应,或者由于连接的主机没有响应而建立的连接失败'(

错误在这一行if(link.startswith("http") or link.startswith("www")):http更改为https,它应该可以工作。我试过了,它收到了所有的电子邮件。

--- 0 ---
--- 1 ---
--- 2 ---
office@smutny-installationen.at
office@offnerwien.at
office@remes-gmbh.at
wien13@lugar.at
office@rossbacher-at.com
office@weiner-gmbh.at
office@wojtek-installateur.at
office@b-gas.at
office@blasl-gmbh.at
gsht@aon.at
office@ertl-installationen.at
office@jakubek.co.at
office@peham-installateur.at
office@installateur-weber.co.at
office@gebrueder-lamberger.at
office@ar-allround-installationen.at

此外,您可以尝试使用urllib3来设置流媒体池。

相关内容

最新更新