优化抓取和请求网页



我应该如何优化我提出请求的时间

link=['http://youtube.com/watch?v=JfLt7ia_mLg',
'http://youtube.com/watch?v=RiYRxPWQnbE'
'http://youtube.com/watch?v=tC7pBOPgqic'
'http://youtube.com/watch?v=3EXl9xl8yOk'
'http://youtube.com/watch?v=3vb1yIBXjlM'
'http://youtube.com/watch?v=8UBY0N9fWtk'
'http://youtube.com/watch?v=uRPf9uDplD8'
'http://youtube.com/watch?v=Coattwt5iyg'
'http://youtube.com/watch?v=WaprDDYFpjE'
'http://youtube.com/watch?v=Pm5B-iRlZfI'
'http://youtube.com/watch?v=op3hW7tSYCE'
'http://youtube.com/watch?v=ogYN9bbU8bs'
'http://youtube.com/watch?v=ObF8Wz4X4Jg'
'http://youtube.com/watch?v=x1el0wiePt4'
'http://youtube.com/watch?v=kkeMYeAIcXg'
'http://youtube.com/watch?v=zUdfNvqmTOY'
'http://youtube.com/watch?v=0ONtIsEaTGE'
'http://youtube.com/watch?v=7QedW6FcHgQ'
'http://youtube.com/watch?v=Sb33c9e1XbY']

我有一个包含第一页 youtube 搜索结果的 15-20 个链接的列表 现在的任务是从每个视频网址中获取喜欢、不喜欢、观看次数,为此我所做的是

def parse(url,i,arr):
req=requests.get(url)
soup = bs4.BeautifulSoup(req.text,"lxml")#, 'html5lib')
try:
likes=int(soup.find("button",attrs={"title": "I like this"}).getText().__str__().replace(",",""))
except:
likes=0
try:
dislikes=int(soup.find("button",attrs={"title": "I dislike this"}).getText().__str__().replace(",",""))
except:
dislikes=0
try:
view=int(soup.find("div",attrs={"class": "watch-view-count"}).getText().__str__().split()[0].replace(",",""))
except:
view=0
arr[i]=(likes,dislikes,view,url)
time.sleep(0.3)
def parse_list(link):
arr=len(link)*[0]
threadarr=len(link)*[0]
import threading
a=time.clock()
for i in range(len(link)):
threadarr[i]=threading.Thread(target=parse,args=(link[i],i,arr))
threadarr[i].start()
for i in range(len(link)):
threadarr[i].join()
print(time.clock()-a)
return arr
arr=parse_list(link)

现在我在大约 6 秒内获得填充的结果数组。有没有更快的方法可以获取我的数组(arr(,以便花费的时间少于 6 秒

我的数组前 4 个元素看起来像这样,以便您大致了解

[(105, 11, 2836, 'http://youtube.com/watch?v=JfLt7ia_mLg'),
(32, 18, 5420, 'http://youtube.com/watch?v=RiYRxPWQnbE'), 
(45, 3, 7988, 'http://youtube.com/watch?v=tC7pBOPgqic'),
(106, 38, 4968, 'http://youtube.com/watch?v=3EXl9xl8yOk')]
Thanks in advance :)

我会在这种特定情况下使用多处理池对象。

import requests
import bs4
from multiprocessing import Pool, cpu_count

links = [
'http://youtube.com/watch?v=JfLt7ia_mLg',
'http://youtube.com/watch?v=RiYRxPWQnbE',
'http://youtube.com/watch?v=tC7pBOPgqic',
'http://youtube.com/watch?v=3EXl9xl8yOk'
]
def parse_url(url):
req=requests.get(url)
soup = bs4.BeautifulSoup(req.text,"lxml")#, 'html5lib')
try:
likes=int(soup.find("button", attrs={"title": "I like this"}).getText().__str__().replace(",",""))
except:
likes=0
try:
dislikes=int(soup.find("button", attrs={"title": "I dislike this"}).getText().__str__().replace(",",""))
except:
dislikes=0
try:
view=int(soup.find("div", attrs={"class": "watch-view-count"}).getText().__str__().split()[0].replace(",",""))
except:
view=0
return (likes, dislikes, view, url)
pool = Pool(cpu_count)   # number of processes
data = pool.map(parse_url, links)   # this is where your results are

这更干净,因为您只有一个函数要编写,并且最终会得到完全相同的结果。

这不是解决方法,但它可以使您的脚本免于使用"try/except 块",这肯定会在一定程度上减慢操作速度。

for url in links:
response = requests.get(url).text
soup = BeautifulSoup(response,"html.parser")
for item in soup.select("div#watch-header"):
view = item.select("div.watch-view-count")[0].text
likes = item.select("button[title~='like'] span.yt-uix-button-content")[0].text
dislikes = item.select("button[title~='dislike'] span.yt-uix-button-content")[0].text
print(view, likes, dislikes)

最新更新