当我尝试下载wikipedia数据转储时,我会不断遇到此错误。是因为我提出了太多下载文件的请求吗?我正在使用100的线程。
在代码1上:
def multithread_download_files_func(self,download_file):
filename = download_file[download_file.rfind("/")+1:]
save_file_w_submission_path = self.ptsf + filename
if not os.path.exists(save_file_w_submission_path):
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
response = urllib.request.urlopen(download_file)
data_content = response.read()
with open(save_file_w_submission_path, 'wb') as wf:
wf.write(data_content)
return filename
,甚至在代码2上:
request = urllib.request.Request(download_file)
response = urllib.request.urlopen(request)
data_content = response.read()
螺纹
p = ThreadPool(100)
results = p.map(self.multithread_download_files_func, matching_fnmatch_list)
for r in results:
print(r)
一致的错误:
File "D:UsersJonathanAnaconda3liburllibrequest.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Service Temporarily Unavailable
url
https://dumps.wikimedia.org/other/pagecounts-raw/
我不知道别人是否有更好的解决方案,但是我找到了一个代码并根据我的需要进行了调整。它将遍历链接,直到给我结果为止。
if not os.path.exists(save_file_w_submission_path):
data_content = None
try:
request = urllib.request.Request(download_file)
response = urllib.request.urlopen(request)
data_content = response.read()
except urllib.error.HTTPError:
retries = 1
success = False
while not success:
try:
response = urllib.request.urlopen(download_file)
success = True
except Exception:
wait = retries * 30;
print('Error! Waiting %s secs and re-trying...' % wait + 'n')
sys.stdout.flush()
time.sleep(wait)
retries += 1
if data_content:
with open(save_file_w_submission_path, 'wb') as wf:
wf.write(data_content)
print(filename)