根据属性对 XML 进行排序,保留 Python 中每个父节点的所有子节点



我有一个xml文件,我想根据属性值对其进行排序。 以下是 xml 文件:

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>

所需的输出是这样的:

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
</images>
</dataset>

我尝试了以下两个选项:

import xml.etree.ElementTree as ET
tree = ET.parse("finalxml.xml")
container = tree.find("images")
data = []
for elem in container:
key = elem.findtext("image")
data.append((key,elem))
data.sort()
container[:] = [item[-1] for item in data]
tree.write("new-data.xml")

此代码只是重新对齐框属性,而不是图像文件属性,这是不可取的。 以下是我从SO那里获得的东西,但没有做任何事情。

# =======================================================================
# Monkey patch ElementTree
import xml.etree.ElementTree as ET
def _serialize_xml(write, elem, encoding, qnames, namespaces):
tag = elem.tag
text = elem.text
if tag is ET.Comment:
write("<!--%s-->" % ET._encode(text, encoding))
elif tag is ET.ProcessingInstruction:
write("<?%s?>" % ET._encode(text, encoding))
else:
tag = qnames[tag]
if tag is None:
if text:
write(ET._escape_cdata(text, encoding))
for e in elem:
_serialize_xml(write, e, encoding, qnames, None)
else:
write("<" + tag)
items = elem.items()
if items or namespaces:
if namespaces:
for v, k in sorted(namespaces.items(),
key=lambda x: x[1]):  # sort on prefix
if k:
k = ":" + k
write(" xmlns%s="%s"" % (
k.encode(encoding),
ET._escape_attrib(v, encoding)
))
#for k, v in sorted(items):  # lexical order
for k, v in items: # Monkey patch
if isinstance(k, ET.QName):
k = k.text
if isinstance(v, ET.QName):
v = qnames[v.text]
else:
v = ET._escape_attrib(v, encoding)
write(" %s="%s"" % (qnames[k], v))
if text or len(elem):
write(">")
if text:
write(ET._escape_cdata(text, encoding))
for e in elem:
_serialize_xml(write, e, encoding, qnames, None)
write("</" + tag + ">")
else:
write(" />")
if elem.tail:
write(ET._escape_cdata(elem.tail, encoding))
ET._serialize_xml = _serialize_xml
from collections import OrderedDict
class OrderedXMLTreeBuilder(ET.XMLTreeBuilder):
def _start_list(self, tag, attrib_in):
fixname = self._fixname
tag = fixname(tag)
attrib = OrderedDict()
if attrib_in:
for i in range(0, len(attrib_in), 2):
attrib[fixname(attrib_in[i])] = self._fixtext(attrib_in[i+1])
return self._target.start(tag, attrib)

tree = ET.parse("example1.xml", OrderedXMLTreeBuilder())
tree.write("new-data.xml")

如何对 xml 进行排序?

使用list.sortkey命名参数 使用每个<image>标签的file属性作为排序键:

key 指定一个参数的函数,该函数用于从每个列表元素中提取比较键(例如,key=str.lower)。列表中每个项目对应的键计算一次,然后用于整个排序过程。默认值 None 表示直接对列表项进行排序,而不计算单独的键值。

import xml.etree.ElementTree
xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>'''
root = xml.etree.ElementTree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images
print(xml.etree.ElementTree.tostring(root))

基于此答案使用lxml的替代解决方案,该答案指出lxml按设置顺序序列化属性(与xml不同):

import lxml.etree
xml_string = r'''<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<text>lol</text>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00003.jpg">
<box top="175" left="59" width="73" height="29">
<label>groundpainting_hotstar</label>
</box>
<box top="174" left="205" width="56" height="24">
<label>groundpainting_yesbank</label>
</box>
<box top="170" left="141" width="44" height="32">
<label>groundpainting_vodafone</label>
</box>
</image>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00001.jpg"/>
<image file="/home/iris/Documents/SONY_MAX-20150408-200026-210358-00002.jpg">
<box top="198" left="17" width="32" height="10">
<label>sightscreen_pepsi</label>
</box>
</image>
</images>
</dataset>'''
root = lxml.etree.fromstring(xml_string)
images_root = root.find('images')
images = images_root.findall('image')
images.sort(key = lambda x: x.attrib['file'])
images_root[:] = images
print(lxml.etree.tostring(root))

注意:这将删除<images>的任何非<image>子项(直系后代)。

最新更新