转换嵌套的XML



我目前正在将嵌套的XML解析为pandas数据表,这样我就可以生成一个CSV,每列都是元素名称,其值是元素文本,但我在解析信息时遇到了一些问题。下面是嵌套XML的一个示例以及我尝试过的内容。

下面的XML可能相当大,包含数百个不同的记录。这就是我尝试的:

##Import modules
import xml.etree.ElementTree as ET
import pandas as pd
from lxml import etree
tree = ET.parse("File.xml")
root = tree.getroot()
for subelement in root:
for subsub in subelement:
print(subsub.tag,",", subsub.text, subsub.attrib, subsub.items())
for subelement in root:
for subsub in subelement:
for subsubsub in subsub:
print(subsubsub.tag,",", subsubsub.text, subsubsub.attrib)
<?xml version="1.0" encoding="utf-16"?>
<test1 xmlns="test.xsd">
<test2 ID="123123123" test3="123123">
<test3>Separate</test3>
<test4>AA</test4>
<Comments>BB</Comments>
<test5>
<test6 ID="123123">
<test3>today</test3>
<test7>123 street</test7>
</test6>
</test5>
<test8>
<test10 ID="434234">
<test3>type of work</test3>
<test9>test work</test9>
</test10>
</test8>
<test11>
<test12 ID="234234234">
<test3>Social</test3>
<test14>test</test14>
</test12>
<test12 ID="123123">
<test3>Something Here</test3>
<test13>Some date</test13>
<test14>123123124433</test14>
</test12>
</test11>
<test15>
<test16 ID="6456456456">
<test3>Something Something</test3>
<test14>746745636</test14>
</test16>
</test15>
</test2>
<test2 ID="353453245" test3="list of something">
<test3>Somewhere</test3>
<test4>Someone</test4>
<Comments>Some comment</Comments>
<test5>
<test6 ID="567456756">
<test3>Not today</test3>
<test7>5634643643</test7>
<test17>Some Info</test17>
<test19>Somewhere</test19>
<test18>63243333</test18>
</test6>
</test5>
<test11>
<test12 ID="456436346">
<test3>Pattern</test3>
<test14>436346346</test14>
</test12>
<test12 ID="4364356">
<test3> ID</test3>
<test14>5674567457</test14>
</test12>
<test12 ID="123123123443">
<test3>Other ID</test3>
<test13>54234532452345</test13>
<test14>231423532452345</test14>
</test12>
</test11>
<test15>
<test16 ID="34252345">
<test3>None test</test3>
<test14>456436436346</test14>
</test16>
</test15>
</test2>
</test1>

更新那么完整的代码会是这样的吗?

###TEST USING EXAMPLE HOTLIST
with open("file.csv", "w", newline='') as fout:
header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
csvout = csv.DictWriter(fout, fieldnames=header)
csvout.writeheader()
row = {}
for _, elem in ET.iterparse('file.xml'):
# strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
tag = re.sub("^{.*?}", "", elem.tag)
if tag == 'test2':
if len(row) != 0:
print(row)
csvout.writerow(row)
row = {}
if len(elem) == 0:
text = elem.text
old = row.get(tag)
if old is None:
# first occurrence of the tag
row[tag] = text
elif isinstance(old, str):
# second occurrence of the tag
row[tag] = [old, text]
else:
# already a list
old.append(text)

对于嵌套的XML,您可以使用iterparse((函数迭代XML中的所有元素。然后,您需要有逻辑来处理元素,这取决于它要添加到字典对象中作为行导出的标记。

for _, elem in ET.iterparse('file.xml'):
if len(elem) == 0:
print(f'{elem.tag} {elem.attrib} text={elem.text}')
else:
print(f'{elem.tag} {elem.attrib}')

要从元素文本在CSV文件中创建一行,可以执行以下操作。例如,如果";test2";标记一个新记录的开始,然后该记录可用于将该记录写入新行并清除下一个记录的字典。

如果想要输出所有或某些属性,则需要为此添加几行代码。如果属性名称与元素名称具有相同的名称,或者多个元素具有相同的属性(例如ID(,则需要在代码中对此进行处理。

import xml.etree.ElementTree as ET
import re
import csv
with open("out.csv", "w", newline='') as fout:
header = ['test3','test4','test7','test9','test13','test14','test17','test18','test19','Comments']
csvout = csv.DictWriter(fout, fieldnames=header)
csvout.writeheader()
row = {}
for _, elem in ET.iterparse('test.xml'):
# strip the namespace from the element tag name; e.g. {Test.xsd}test14 > test14
tag = re.sub("^{.*?}", "", elem.tag)
if tag == 'test2':
if len(row) != 0:
print(row)
csvout.writerow(row)
row = {}
if len(elem) == 0:
row[tag] = elem.text

输出:

{'test3': 'Something Something', 'test4': 'AA', 'Comments': 'BB', 'test7': '123 street', 'test9': 'test work', 'test14': '746745636', 'test13': 'Some date'}
{'test3': 'None test', 'test4': 'Someone', 'Comments': 'Some comment', 'test7': '5634643643', 'test17': 'Some Info', 'test19': 'Somewhere', 'test18': '63243333', 'test14': '456436436346', 'test13': '54234532452345'}

CSV输出:

test3,test4,test7,test9,test13,test14,test17,test18,test19,Comments
Something Something,AA,123 street,test work,Some date,746745636,,,,BB
None test,Someone,5634643643,,54234532452345,456436436346,Some Info,63243333,Somewhere,Some comment

更新:

如果想要处理重复的标签并创建一个值列表,那么可以尝试以下操作:

if len(elem) == 0:
text = elem.text
old = row.get(tag)
if old is None:
# first occurrence
row[tag] = text
elif isinstance(old, str):
# second occurrence > create list
row[tag] = [old, text]
else:
old.append(text)

最新更新