当存在重复标记时,使用Python将XML错误报告数据集解析为CSV



我有一个XML格式的bug报告数据集,来自我控制之外的存储库。

<?xml version="1.0"?>
<short_desc>
<report id="756550">
<update> #first update
<when>1337336250</when>
<what>Alias is resolved to a bug number for a private bug</what>
</update>
<update> #latest update
<when>1344175272</when>
<what>Do not link a bug alias with its bug ID for bugs you cannot see</what>
</update>
</report>
</short_desc>

我希望在CSV文件中获得以下输出:

id       description
756550   Alias is resolved to a bug number for a private bug
756550   Do not link a bug alias with its bug ID for bugs you cannot see

我已经尝试在Python中使用Elementtree,但我只能检索标签的内容,而没有相应的报告ID。

有人能帮我解决这个问题吗?Thx

您可以尝试下面的代码:

它的作用是用python的xml包解析xml。

通过xpath表达式查找感兴趣的节点。

import xml.etree.ElementTree as ET

data = """<?xml version="1.0"?>
<short_desc>
<report id="756550">
<update> #first update
<when>1337336250</when>
<what>Alias is resolved to a bug number for a private bug</what>
</update>
<update> #latest update
<when>1344175272</when>
<what>Do not link a bug alias with its bug ID for bugs you cannot see</what>
</update>
</report>
</short_desc>"""

root = ET.fromstring(data)
whens = root.findall(".//when")
whats = root.findall(".//what")
for id, desc in zip(whens, whats):
print(id.text, desc.text)

您可以通过键访问元素的属性来获得id,例如report['id']

from bs4 import BeautifulSoup
from io import BytesIO
data = b'''
<?xml version="1.0"?>
<short_desc>
<report id="756550">
<update> #first update
<when>1337336250</when>
<what>Alias is resolved to a bug number for a private bug</what>
</update>
<update> #latest update
<when>1344175272</when>
<what>Do not link a bug alias with its bug ID for bugs you cannot see</what>
</update>
</report>
</short_desc>
'''
print("id      description")
f = BytesIO(data)
soup = BeautifulSoup(f, 'html.parser')
for report in soup.select('report'):
id_ = report['id']
for what in report.select('what'):
print(f"{id_:<7} {what.text}")

创建一个元组列表,并做你喜欢做的事情。

import xml.etree.ElementTree as ET

data = """<?xml version="1.0"?>
<short_desc>
<report id="756550">
<update> #first update
<when>1337336250</when>
<what>Alias is resolved to a bug number for a private bug</what>
</update>
<update> #latest update
<when>1344175272</when>
<what>Do not link a bug alias with its bug ID for bugs you cannot see</what>
</update>
</report>
</short_desc>"""

root = ET.fromstring(data)
data = [(e.find('when').text,e.find('what').text) for e in root.findall('.//update')]
print(data)

输出
[('1337336250', 'Alias is resolved to a bug number for a private bug'), ('1344175272', 'Do not link a bug alias with its bug ID for bugs you cannot see')]

最新更新