从"&"之后提取值：value="&videoId=139209&videoUrl=https：//mp5.website.net

我有这些不同的链接，它们都包含不同的内容，我试图从中获取数据。

我成功到了一定程度，但现在我被困住了，寻求帮助以更好地了解美丽的汤。

在这个特定问题上，文档对我没有多大帮助，也没有谷歌搜索能够帮助我。

我的脚本是这样的：

r = requests.get(link)
raw = r.text
soup = BeautifulSoup(raw, features="html.parser")
inputTag = soup.find("input", {"id": "videoId"})
output = inputTag["value", "videoUrl"]
print(output)

我似乎无法弄清楚的是如何在长字符串中获取特定的输入值(在每个"&"之后(，例如：

<input type="text" style="display: none" id="videoId" value="&videoId=139209&videoUrl=https://mp5.website.net/storage1/M03/10/92/aPODC10sfP-AcFDnAGhUgdKc7iA667.mp4&videoImg=https://mp5.website.net/storage1/M03/10/97/aPODCl0sfP-ACNFjAABmn9NL64Q064.png&videoIntroduction=[{"content":"Everything in the world is a matrix","type":1,"userId":""}]userNickName=Califax'>

如果我像这样离开我的output = inputTag["value"]，我会得到"值"，但是我如何解析例如videoId=和videoUrl=让我感到困惑。

希望有人能指导我朝着正确的方向实现这一目标。

编辑 JSON 部分。

使用您的建议代码，我现在收到此错误：

Traceback (most recent call last):
File "/run/media/anonymous/06bcf743-8b4d-409f-addc-520fc4e19299/PycharmProjects/learningcurve/video_moments.py", line 34, in <module>
videoIntroduction = json.loads(output['videoIntroduction'][0])
File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.7/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 85 (char 84)

您可以使用urllib(固定格式由@facelessuser

import urllib.parse
import json
value = '&videoId=139209&videoUrl=https://mp5.website.net/storage1/M03/10/92/aPODC10sfP-AcFDnAGhUgdKc7iA667.mp4&videoImg=https://mp5.website.net/storage1/M03/10/97/aPODCl0sfP-ACNFjAABmn9NL64Q064.png&videoIntroduction=[{"content":"Everything in the world is a matrix","type":1,"userId":""}]userNickName=Califax'

由于这是格式不正确的，因此可以进行一些基本修复。像这样：

fixed_value = value.replace(']user', ']&user')
output = urllib.parse.parse_qs(fixed_value)

哪收益率一个字典

{'videoId': ['139209'], 'videoUrl': ['https://mp5.website.net/storage1/M03/10/92/aPODC10sfP-AcFDnAGhUgdKc7iA667.mp4'], 'videoImg': ['https://mp5.website.net/storage1/M03/10/97/aPODCl0sfP-ACNFjAABmn9NL64Q064.png'], 'videoIntroduction': ['[{"content":"Everything in the world is a matrix","type":1,"userId":""}]'], 'userNickName': ['Califax']}

所以对于您的情况，例如

output = urllib.parse.parse_qs(inputTag["value"])

您可以将元素作为字典和列表索引进行访问

print(output['videoIntroduction'][0])
[{"content":"Everything in the world is a matrix","type":1,"userId":""}]userNickName=Califax

这是一个 JSON 字符串，因此请将其解码为字典

videoIntroduction = json.loads(output['videoIntroduction'][0])
print(videoIntroduction[0]["content"])
print(videoIntroduction[0]["type"])

哪些打印

Everything in the world is a matrix
1

发布的标签似乎有点格式不正确，所以我不得不修复它以便它解析，但话虽如此，我会解释一下。价值似乎以"开盘，但随后以'收盘。此外，假设userNickName=Califax之前缺少一个&。我可能是错的，但答案的基础应该仍然是相关的。

在您的示例中，您可以找到输入并将其分配给inputTag。inputTag是一个input元素。当您使用表单表示法input['key']时，它会查找名称为key的 HTML 属性。在您的情况下，您想访问value.value的内容是一个非常大的字符串，其中的键、值对由&分隔。BeautifulSoup不知道任意数据的存储方式，它只是返回所需属性的值，在您的情况下是一个非常大的字符串。我们必须解析这些数据，因为BeautifulSoup不知道如何解析。

在这种情况下，我们可以简单地删除第一个&，然后按&.然后我们可以拆分第一个返回的每个项目=.这将给我们留下一个结构[(key1, value1), (key2, value2), ...].这非常适合创建字典，因为这是它需要的格式。因此，我们可以调用dict将其发送为我们的结构。

之后，我们有一个字典，其键等于 HTML 属性中的每个键value.我们可以简单地访问我们想要的密钥：

from bs4 import BeautifulSoup
html = """
<input type="text" style="display: none" id="videoId" value='&videoId=139209&videoUrl=https://mp5.website.net/storage1/M03/10/92/aPODC10sfP-AcFDnAGhUgdKc7iA667.mp4&videoImg=https://mp5.website.net/storage1/M03/10/97/aPODCl0sfP-ACNFjAABmn9NL64Q064.png&videoIntroduction=[{"content":"Everything in the world is a matrix","type":1,"userId":""}]&userNickName=Califax'>
"""
soup = BeautifulSoup(html, features="html.parser")
inputTag = soup.find("input", {"id": "videoId"})
output = inputTag["value"]
values = dict([x.split('=', 1) for x in output.lstrip('&').split('&')])
print('=== Values ===')
print(values)
print('=== Wanted videoUrl ===')
print(values['videoUrl'])

输出

=== Values ===                                                                                                                                                                        
{'videoId': '139209', 'videoUrl': 'https://mp5.website.net/storage1/M03/10/92/aPODC10sfP-AcFDnAGhUgdKc7iA667.mp4', 'videoImg': 'https://mp5.website.net/storage1/M03/10/97/aPODCl0sfP-ACNFjAABmn9NL64Q064.png', 'videoIntroduction': '[{"content":"Everything in the world is a matrix","type":1,"userId":""}]', 'userNickName': 'Califax'}                                 
=== Wanted videoUrl ===                                                                                                                                                               
https://mp5.website.net/storage1/M03/10/92/aPODC10sfP-AcFDnAGhUgdKc7iA667.mp4

相关内容

最新更新

热门标签：