如何使用BeautifulSoup通过正则表达式在任何XML深度中识别属性来查找bs4XML属性值

我有以下bs4元素：

from bs4 import BeautifulSoup
html_doc = """
<l2 attribute2="Output"><s3><Cell cell_value2="384.01"/></s3></l2>, 
<l1><s3 attribute1="Cost"><s4><Cell cell_value1="2314.37"/></s4></s3></l1>
"""
soup = BeautifulSoup(html_doc, "html.parser")

我想提取所有的属性值，如下所示：

["Output", "Cost"]

我的问题是：我如何用正则表达式re.compile(r'^attribute[0-9]$')实现这一点，并且在attribute*可以是第一个标签上的情况(例如l1或l2(；"更深"；例如在s3或其他任意深度中(？

如果属性具有相同的名称，或者它们在相同的深度级别中具有不同的名称，但不能同时具有这两个名称，我可以这样做。

import re
from bs4 import BeautifulSoup
html_doc = """
<l2 attribute2="Output"><s3><Cell cell_value2="384.01"/></s3></l2>, 
<l1><s3 attribute1="Cost"><s4><Cell cell_value1="2314.37"/></s4></s3></l1>
"""
soup = BeautifulSoup(html_doc, "html.parser")
r = re.compile(r"^attributed+")
out = []
for tag in soup.find_all(lambda tag: any(r.search(a) for a in tag.attrs)):
for attr, value in tag.attrs.items():
if r.search(attr):
out.append(value)
print(out)

打印：

['Output', 'Cost']

相关内容

最新更新

热门标签：