我有一个XML格式的bug报告数据集,来自我控制之外的存储库。
<?xml version="1.0"?>
<short_desc>
<report id="756550">
<update> #first update
<when>1337336250</when>
<what>Alias is resolved to a bug number for a private bug</what>
</update>
<update> #latest update
<when>1344175272</when>
<what>Do not link a bug alias with its bug ID for bugs you cannot see</what>
</update>
</report>
</short_desc>
我希望在CSV文件中获得以下输出:
id description
756550 Alias is resolved to a bug number for a private bug
756550 Do not link a bug alias with its bug ID for bugs you cannot see
我已经尝试在Python中使用Elementtree,但我只能检索标签的内容,而没有相应的报告ID。
有人能帮我解决这个问题吗?Thx您可以尝试下面的代码:
它的作用是用python的xml包解析xml。
通过xpath表达式查找感兴趣的节点。
import xml.etree.ElementTree as ET
data = """<?xml version="1.0"?>
<short_desc>
<report id="756550">
<update> #first update
<when>1337336250</when>
<what>Alias is resolved to a bug number for a private bug</what>
</update>
<update> #latest update
<when>1344175272</when>
<what>Do not link a bug alias with its bug ID for bugs you cannot see</what>
</update>
</report>
</short_desc>"""
root = ET.fromstring(data)
whens = root.findall(".//when")
whats = root.findall(".//what")
for id, desc in zip(whens, whats):
print(id.text, desc.text)
您可以通过键访问元素的属性来获得id
,例如report['id']
。
from bs4 import BeautifulSoup
from io import BytesIO
data = b'''
<?xml version="1.0"?>
<short_desc>
<report id="756550">
<update> #first update
<when>1337336250</when>
<what>Alias is resolved to a bug number for a private bug</what>
</update>
<update> #latest update
<when>1344175272</when>
<what>Do not link a bug alias with its bug ID for bugs you cannot see</what>
</update>
</report>
</short_desc>
'''
print("id description")
f = BytesIO(data)
soup = BeautifulSoup(f, 'html.parser')
for report in soup.select('report'):
id_ = report['id']
for what in report.select('what'):
print(f"{id_:<7} {what.text}")
创建一个元组列表,并做你喜欢做的事情。
import xml.etree.ElementTree as ET
data = """<?xml version="1.0"?>
<short_desc>
<report id="756550">
<update> #first update
<when>1337336250</when>
<what>Alias is resolved to a bug number for a private bug</what>
</update>
<update> #latest update
<when>1344175272</when>
<what>Do not link a bug alias with its bug ID for bugs you cannot see</what>
</update>
</report>
</short_desc>"""
root = ET.fromstring(data)
data = [(e.find('when').text,e.find('what').text) for e in root.findall('.//update')]
print(data)
输出[('1337336250', 'Alias is resolved to a bug number for a private bug'), ('1344175272', 'Do not link a bug alias with its bug ID for bugs you cannot see')]