Python从URL-html渲染中抓取youtube标题的速度太慢

hi我有一个带有youtube url列表的excel文件，我正试图获得它们的标题，因为它是1000个url的完整列表，有3个excel文件。我试图使用python，但它太慢了，因为我不得不在html上放sleep命令。渲染代码如下：

import xlrd
import time
from bs4 import BeautifulSoup
import requests
from xlutils.copy import copy
from requests_html import HTMLSession

loc = ("testt.xls")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
wb2 = copy(wb)
sheet.cell_value(0, 0)
for i in range(3,sheet.nrows):

ytlink = (sheet.cell_value(i, 0))
session = HTMLSession()
response = session.get(ytlink)
response.html.render(sleep=3)
print(sheet.cell_value(i, 0))
print(ytlink)
element = BeautifulSoup(response.html.html, "lxml")
media = element.select_one('#container > h1').text
print(media)
s2 = wb2.get_sheet(0)
s2.write(i, 0, media)
wb2.save("testt.xls")

我的意思是，无论如何都有让它更快的方法吗？我试过硒，但我想它更慢了。有了这个html.render，我似乎需要使用"；睡眠；计时器，否则它会给我错误。我尝试了较低的睡眠值，但过了一段时间后，它就出现了错误。任何帮助，请感谢：(

ps：我放的打印只是为了检查输出，对使用来说并不重要。

使用您当前的方法/Senium，您正在呈现实际的网页，而您不需要这样做。我建议您使用Python库来处理它。以下是YoutubeDL:的示例

with YoutubeDL() as ydl:
title = ydl.extract_info("https://www.youtube.com/watch?v=jNQXAC9IVRw", download=False).get("title", None)
print(title)

请注意，在YouTube规定的速率限制下，处理1000个这样的请求仍然很慢。如果您计划在未来进行可能的1000次请求，我建议您考虑获取API密钥。

使用异步请求html，您可以在不到一分钟的时间内完成1000个请求，如下所示：

import random
from time import perf_counter
from requests_html import AsyncHTMLSession
urls = ['https://www.youtube.com/watch?v=z9eoubnO-pE'] * 1000
asession = AsyncHTMLSession()
start = perf_counter()
async def fetch(url):
r = await asession.get(url, cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))})
return r
all_responses = asession.run(*[lambda url=url: fetch(url) for url in urls])
all_titles = [r.html.find('title', first=True).text for r in all_responses]
print(all_titles)
print(perf_counter() - start)

在我的笔记本电脑上55秒完成。

请注意，您需要将cookies={'CONSENT': 'YES+cb.20210328-17-p0.en-GB+FX+{}'.format(random.randint(100, 999))}传递给请求以避免此问题。

相关内容

最新更新

热门标签：