如何在 Python 中同时或并行执行此操作(检查请求状态代码)?



这是我正在做的事情:

  • 从文本文件中获取单词 - 每个单词都在单独的行上。
  • 向字词添加http://www..com以创建网址。
  • 获取包含请求的 URL。
  • 了解它是否是免费域名(基于状态代码和 连接错误/其他错误(。
  • 将免费域添加到文本文件。
  • 计时。

到目前为止,我已经让它工作了,但它非常慢。文本文件有 350 000 个单词。我将如何同时或并行执行此操作?另外,对于此任务,哪个是更好的选择?

这是我的代码:

import requests, time
start = time.time()
with open('words1.txt','r') as f:
words = []
for item in f:
words.append(item.strip())
for w in words:
url = 'http://www.'+w+'.com'
try:
header = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(url, headers=header)
codes = [200,201,202,203,204,205,206,300,301,302,303,307,308,400,401,402,403,404,405,406,500,501,502,503]
if r.status_code in codes:
print(url,': Known Status Code > Unavailable')
else:
print(url,': Unknown Status Code > Probably Free')
with open('available.txt','a') as myfile:
myfile.write(url+'n')
except requests.exceptions.ConnectionError:
print(url,' : Connection Error > Probably Free')
with open('available.txt','a') as myfile:
myfile.write(url+'n')
except requests.exceptions.HTTPError:
print('http error')
except requests.exceptions.Timeout:
print('timeout error')
except requests.exceptions.TooManyRedirects:
print('too many redirects')
end = time.time()
print('n')
print(end-start, 'seconds')
print((end-start)/60,'minutes')
print(((end-start)/60)/60,'hours')

谢谢!

编辑:我让它工作。感谢肯达斯深空的帮助! 这是一个快速测试:

100 字 - 22 秒

1000 字 - 285 秒

不是太快,但比我第一次尝试快得多。

似乎gevent +socket是要走的路。

如果您有任何关于使其更好/更快的提示,请告诉我。

代码如下:

import gevent,time
from gevent import socket
start = time.time()
words = []
with open('words1000.txt','r') as f:
for item in f:
words.append(item.strip())
urls = ['www.{}.com'.format(w) for w in words]
jobs = [gevent.spawn(socket.gethostbyname, url) for url in urls]
gevent.joinall(jobs)
values = {url:job.value for (url,job) in zip(urls,jobs)}
freeDomains = []
for (v,job,url) in zip(values,jobs,urls):
if job.value == None:
freeDomains.append(url)
with open('availableds.txt','a') as myFile:
myFile.write(url+'n')
print(freeDomains)
end = time.time()
print(end-start,'seconds')
print((end-start)/60,'minutes')
print((end-start)/3600,'hours')

grequests(请求的并发版本(使这变得非常简单。 这也将有助于使用.format而不是在每次迭代中重新定义header

import grequests
def exception_handler(request, exception):
print(exception)
with open('words1.txt','r') as f:
words = []
for item in f:
words.append(item.strip())
urls = ['http://www.{}.com'.format(w) for w in words]
header = {'User-Agent': 'Mozilla/5.0'}
requests = [grequests.get(url) for url in urls]
responses = grequests.map(requests, exception_handler=exception_handler)
for resp in responses:
if resp:
print(resp.status_code)

相关内容

  • 没有找到相关文章

最新更新