使用 lxml SAX 解析大型 xml 文件



我有一个巨大的xml文件,看起来像这样

<environment>
<category name='category1'>
<peoples>
<people>
<name>Mary</name>
<city>NY</city>
<age>10</age>
</people>
<people>
<name>Jane</name>
<city>NY</city>
<age>19</age>
</people>
<people>
<name>John</name>
<city>NY</city>
<age>20</age>
</people>
<people>
<name>Carl</name>
<city>DC</city>
<age>11</age>
</people>
...
</people>
</category>
<category name='category2'>
...
</category
</environment>

我想将 xml 文件和输出解析为字典,其中键是类别的名称(示例中的类别 1、类别 2(和每个类别可能不同的值字典。目前,我只对类别 1 感兴趣,我想在其中形成一个字典,其中键是名称,值是年龄,它只包含居住在城市的人 = NY

所以最终输出将是这样的:

{ 'cateogory1': { 'Mary': 10, 'Jane': 19, 'John':20 }, 'cateogory2': {} }

我首先尝试使用迭代解析,但出现内存错误:

result = {}
for _, element in etree.iterparse('file.xml', tag='category'):
result[element.get('name')] = {}
if element.get('name') == 'category':
persons = {}
for person in element.findall('peoples/people'):
name, city, age = person.getchildren()
if city.text == 'NY':
persons[name.text] = age.text
result[element.get('name')] = persons
element.clear()
return results

所以我的第二次尝试是使用 SAX,但我不熟悉它。我从这里开始了一个脚本,但找不到一种方法将名字与一个人的城市和年龄联系起来:

class CategoryParser(object):
def __init__(self, d):
self.d = d
def start(self, start, attrib):
if tag == 'category':
self.group = self.d[attrib['name']] = {}
elif tag == 'people':
# Don't know how to access name, city and age for this person
def close(self):
pass
result = {}
parser = lxml.etree.XMLParser(target=CategoryParser(result))
lxml.etree.parse('file.xml', parser)

实现预期结果的最佳方法是什么?我愿意使用其他方法。

您的lxml方法看起来非常接近,但我不确定为什么它会给出MemoryError.不过,您可以使用内置xml.etree.ElementTree轻松完成此操作。

使用此 xml(从您的示例中稍作修改(:

xml = '''<environment>
<category name='category1'>
<peoples>
<people>
<name>Mary</name>
<city>NY</city>
<age>10</age>
</people>
<people>
<name>Jane</name>
<city>NY</city>
<age>19</age>
</people>
<people>
<name>John</name>
<city>NY</city>
<age>20</age>
</people>
<people>
<name>Carl</name>
<city>DC</city>
<age>11</age>
</people>
</peoples>
</category>
<category name='category2'>
<peoples>
<people>
<name>Mike</name>
<city>NY</city>
<age>200</age>
</people>
<people>
<name>Jimmy</name>
<city>HW</city>
<age>94</age>
</people>
</peoples>
</category>
</environment>'''

我这样做:

import xml.etree.ElementTree as ET
root = ET.fromstring(xml)
x = dict()
# Iterate all "category" nodes
for c in root.findall('./category'):
# Store "name" attribute
name = c.attrib['name']
# Insert empty dictionary for current category
x[name] = {}
# Iterate all people nodes contained in this category that have
# a child "city" node matching "NY"
for p in c.findall('./peoples/people[city="NY"]'):
# Get text of "name" child node
# (accessed by iterating parent node)
# i.e. "list(p)" -> [<Element 'name' at 0x04BB2750>, <Element 'city' at 0x04BB2900>, <Element 'age' at 0x04BB2A50>])
person_name = next(e for e in p if e.tag == 'name').text
# Same for "age" node, and convert to int
person_age = int(next(e for e in p if e.tag == 'age').text)
# Add entry to current category dictionary
x[name][person_name] = person_age

这给了我以下字典:

{'category1': {'Mary': 10, 'Jane': 19, 'John': 20}, 'category2': {'Mike': 200}}

另外,关于示例xml的一些注释(可能只是复制/粘贴工件,但以防万一(:

  • 您的关闭/peoples节点缺少"s">
  • 您的最后一个关闭/category节点缺少关闭">">

由于您使用lxml并指示开放以使用其他方法,因此请考虑 XSLT,这是一种专用语言,旨在将 XML 文档转换为各种格式(包括文本文件(。

具体来说,沿着你的树走下来,按节点值构建所需的大括号和引号。由于所需的字典可以是有效的 JSON,因此请将 XSLT 结果导出为 .json!

XSLT (另存为 .xsl 文件,一个特殊的.xml文件(

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="text"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="pst">&apos;</xsl:variable>
<xsl:template match="/environment">
<xsl:text>{&#xa;</xsl:text>
<xsl:apply-templates select="category"/>
<xsl:text>&#xa;}</xsl:text>
</xsl:template>
<xsl:template match="category">
<xsl:value-of select="concat('  ', $pst, @name, $pst, ': {')"/>
<xsl:apply-templates select="peoples/people[city='NY']"/>
<xsl:text>}</xsl:text>
<xsl:if test="position() != last()">
<xsl:text>,&#xa;</xsl:text>
</xsl:if>
</xsl:template>
<xsl:template match="people">
<xsl:value-of select="concat($pst, name, $pst, ': ', age)"/>
<xsl:if test="position() != last()">
<xsl:text>, </xsl:text>
</xsl:if>
</xsl:template>
</xsl:stylesheet>

Python(没有for循环、if逻辑或def构建(

import ast
import lxml.etree as et
# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('Script.xsl')
# TRANSFORM INPUT
transformer = et.XSLT(xsl)
output_str = transformer(xml)
# BUILD DICT LITERALLY
new_dict = ast.literal_eval(str(output_str))
print(new_dict)
# {'category1': {'Mary': 10, 'Jane': 19, 'John': 20} }
# OUTPUT JSON
with open('Output.json', 'wb') as f:
f.write(output_str)
# {
#   "category1": {"Mary": 10, "Jane": 19, "John": 20}
# }

在线演示(具有用于演示的扩展节点(

最新更新