我尝试多处理URL获取过程,因为否则处理我想要处理的300k个URL会花费大量时间。不知何故,我的代码在随机时间后停止工作,我不知道为什么。你可以帮我吗?我已经对此进行了一些研究,但找不到任何对我有很大帮助的东西。通常我可以处理大约 20k 个链接,但随后它会冻结而没有错误,只是没有进一步处理链接并且程序仍在运行。也许所有流程都充斥着不良链接?有什么办法可以弄清楚吗?
urls = list(datafull['SOURCEURL'])
#datafull['SOURCEURL'].apply(html_reader)
with futures.ThreadPoolExecutor(max_workers=50) as executor:
pages = executor.map(html_reader,urls)
我的html_reader脚本:
def html_reader(url):
try:
os.chdir('/Users/benni/PycharmProjects/Untitled Folder/HTML raw')
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text
url = str(url).replace('/','').replace('http:','').replace('https:','')
name = 'htmlreader_'+url+'.html'
f = open(name,'a')
f.write(str(data))
f.close()
print(time.time(),' ',url)
return data
except Exception:
pass
多谢!
我修改并清理了一些代码。你可以试试这个。在 main 方法中,您需要输入机器上可用内核数减去 1 作为n_jobs参数的值。我使用 joblib 库进行多处理,因此您需要将其安装在您的机器上。
import requests as rq
import os
from joblib import Parallel, delayed
os.chdir('/Users/benni/PycharmProjects/Untitled Folder/HTML raw')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
def write_data(htmlText):
with open("webdata.txt","w+") as fp:
fp.write(htmlText)
def get_html(url):
resp = rq.get(url,headers=headers)
if (resp.status_code==200):
write_data(resp.text)
else:
println("No Data received for : {0}".format(url))
if __name__ == "__main__":
urls = list(datafull['SOURCEURL'])
Parallel(n_jobs="no.of cores on your machine - 1")(delayed(get_html)(link) for link in urls)