如何解析ted演讲的文本



无法解析来自https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript的视频文本

请求不会看到文本实际所在的span类。有什么问题吗?

import requests
url = 'https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript'
page = requests.get(url)
print(page.content)

有办法联系到成绩单吗?谢谢你!我要拿到这个没有找到属性

这是因为数据不是通过您正在使用的链接加载的,而是通过调用他们的GraphQL实例加载的。

使用curl,您可以像这样获取数据:

curl 'https://www.ted.com/graphql?operationName=Transcript&variables=%7B%22id%22%3A%22alexis_nikole_nelson_a_flavorful_field_guide_to_foraging%22%2C%22language%22%3A%22en%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2218f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6%22%7D%7D' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Referer: https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript' -H 'content-type: application/json' -H 'client-id: Zenith production' -H 'x-operation-name: Transcript' --output - | gzip -d

注意,URL是urlencoded的。您可以导入from urllib.parse import quote,以使用quote()方法在python中对字符串进行urlencode。

所以只需将上面的curl命令翻译成python。没有什么神奇的,只要设置正确的标题即可。如果你很懒,你也可以使用这个在线转换器,将curl命令转换为python代码。

这产生:

import requests
from requests.structures import CaseInsensitiveDict
url = "https://www.ted.com/graphql?operationName=Transcript&variables=%7B%22id%22%3A%22alexis_nikole_nelson_a_flavorful_field_guide_to_foraging%22%2C%22language%22%3A%22en%22%7D&extensions=%7B%22persistedQuery%22%3A%7B%22version%22%3A1%2C%22sha256Hash%22%3A%2218f8e983b84c734317ae9388c83a13bc98702921b141c2124b3ce4aeb6c48ef6%22%7D%7D"
headers = CaseInsensitiveDict()
headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
headers["Accept"] = "*/*"
headers["Accept-Language"] = "en-US,en;q=0.5"
headers["Accept-Encoding"] = "gzip, deflate, br"
headers["Referer"] = "https://www.ted.com/talks/alexis_nikole_nelson_a_flavorful_field_guide_to_foraging/transcript"
headers["content-type"] = "application/json"
headers["client-id"] = "Zenith production"
headers["x-operation-name"] = "Transcript"
resp = requests.get(url, headers=headers)
print(resp.content)
输出:

b'{"data":{"translation":{"id":"209255","language" ...

最新更新