httperror:服务暂时不可用(Wikipedia数据转储的多线程下载)



当我尝试下载wikipedia数据转储时,我会不断遇到此错误。是因为我提出了太多下载文件的请求吗?我正在使用100的线程。

在代码1上:

def multithread_download_files_func(self,download_file):
    filename = download_file[download_file.rfind("/")+1:]
    save_file_w_submission_path = self.ptsf + filename
    if not os.path.exists(save_file_w_submission_path):
        opener = urllib.request.build_opener()
        opener.addheaders = [('User-agent', 'Mozilla/5.0')]
        urllib.request.install_opener(opener)
        response = urllib.request.urlopen(download_file)
        data_content = response.read()                 
    with open(save_file_w_submission_path, 'wb') as wf:    
        wf.write(data_content)
    return filename

,甚至在代码2上:

    request = urllib.request.Request(download_file)
    response = urllib.request.urlopen(request)
    data_content = response.read()

螺纹

p = ThreadPool(100)
results = p.map(self.multithread_download_files_func, matching_fnmatch_list)
for r in results:
    print(r)

一致的错误:

  File "D:UsersJonathanAnaconda3liburllibrequest.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Service Temporarily Unavailable

url

https://dumps.wikimedia.org/other/pagecounts-raw/

我不知道别人是否有更好的解决方案,但是我找到了一个代码并根据我的需要进行了调整。它将遍历链接,直到给我结果为止。

if not os.path.exists(save_file_w_submission_path):
    data_content = None
    try:
        request = urllib.request.Request(download_file)
        response = urllib.request.urlopen(request)
        data_content = response.read()     
    except urllib.error.HTTPError:
        retries = 1
        success = False
        while not success:
            try:
                response = urllib.request.urlopen(download_file)
                success = True
            except Exception:
                wait = retries * 30;
                print('Error! Waiting %s secs and re-trying...' % wait + 'n')
                sys.stdout.flush()
                time.sleep(wait)
                retries += 1
    if data_content:
        with open(save_file_w_submission_path, 'wb') as wf:    
            wf.write(data_content)
        print(filename)

最新更新