正在分析具有命名空间的XML属性



给定以下XML

<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom">
<id>1</id>
<title>Example XML</title>
<published>2021-12-15T00:00:00Z</published>
<updated>2022-01-06T12:44:47Z</updated>
<content type="application/xml">
<articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  chemaVersion="1.8" xml:lang="en">
<articleDocHead>
<itemInfo/>
</articleDocHead>
</articleDoc>
</content>
</entry>

如何获取entry/content/articleDoc属性中xml:lang属性的值?我已经检查了Python文档,但遗憾的是,它没有涵盖带有名称空间的属性。如果通过手动将名称空间写在属性名称前面作为字典键来找到解决方案,则该解决方案似乎是错误的。我使用的是Python 3.9.9。

到目前为止,这是我的代码:

import xml.etree.cElementTree as tree
xml = """<?xml version="1.0" encoding="UTF-8"?>
<entry xmlns="http://www.w3.org/2005/Atom">
<id>1</id>
<title>Example XML</title>
<published>2021-12-15T00:00:00Z</published>
<updated>2022-01-06T12:44:47Z</updated>
<content type="application/xml">
<articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" schemaVersion="1.8" xml:lang="en">
<articleDocHead>
<itemInfo/>
</articleDocHead>
</articleDoc>
</content>
</entry>"""
ns = {'nitf': 'http://iptc.org/std/NITF/2006-10-18/',
'w3': 'http://www.w3.org/2005/Atom',
'xml': 'http://www.w3.org/XML/1998/namespace'}
root = tree.fromstring(xml)
id = root.find("w3:id", ns).text # works
print(id)
type_attribute = root.find("w3:content", ns).attrib['type'] # works
print(type_attribute)
#language = root.find("w3:content/articleDoc/articleDocHeader[xml:lang']", ns) # doesn't work
language = root.find("w3:content/articleDoc", ns).attrib['{http://www.w3.org/XML/1998/namespace}lang'] # works, but seems wrong
print(language)

感谢您的帮助。非常感谢!

这里是如何使用lxml.etree在xml文件中定位的快速指南

In [2]: import lxml.etree as etree
In [3]: xml = """
...:     <entry xmlns="http://www.w3.org/2005/Atom" xmlns:demo="http://www.wh
...: atever.com">
...:       <id>1</id>
...:       <demo:demo_child>some namespace entry</demo:demo_child>
...:       <title>Example XML</title>
...:       <published>2021-12-15T00:00:00Z</published>
...:       <updated>2022-01-06T12:44:47Z</updated>
...:       <content type="application/xml">
...:         <articleDoc xmlns="" xmlns:xsi="http://www.w3.org/2001/XMLSchema
...: -instance" schemaVersion="1.8" xml:lang="en">
...:           <articleDocHead>
...:             <itemInfo/>
...:           </articleDocHead>
...:         </articleDoc>
...:       </content>
...:     </entry>"""
In [4]: tree = etree.fromstring(xml)
In [5]: tree
Out[5]: <Element {http://www.w3.org/2005/Atom}entry at 0x7d010c153800>
In [6]: list(tree.iterchildren())  # get children of cuurent element
Out[6]: 
[<Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>,
<Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>,
<Element {http://www.w3.org/2005/Atom}title at 0x7d010c9c5180>,
<Element {http://www.w3.org/2005/Atom}published at 0x7d01233d6cc0>,
<Element {http://www.w3.org/2005/Atom}updated at 0x7d010c0d4580>,
<Element {http://www.w3.org/2005/Atom}content at 0x7d010c0d46c0>]
In [7]: print([el.tag for el in tree.iterchildren()])    # get children of cuurent element (human readable)
['{http://www.w3.org/2005/Atom}id', '{http://www.whatever.com}demo_child', '{http://www.w3.org/2005/Atom}title', '{http://www.w3.org/2005/Atom}published', '{http://www.w3.org/2005/Atom}updated', '{http://www.w3.org/2005/Atom}content']
In [8]: print(tree[0] == next(tree.iterchildren()))  # you can also access by #tree[index]
True
In [9]: tree.find('id')  # FAILS: did not consider default namespace
In [10]: tree.find('{http://www.w3.org/2005/Atom}id')  # now it works
Out[10]: <Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>
In [11]: tree.find('{http://www.w3.org/2005/Atom}demo_child')  # FAILS: element with non-default namespace
In [12]: tree.find('{http://www.whatever.com}demo_child')  # take proper namespace
Out[12]: <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>
In [13]: tree.find(f'{{{tree.nsmap["demo"]}}}demo_child')  # do not spell out full namespace
Out[13]: <Element {http://www.whatever.com}demo_child at 0x7d010c9c54c0>
In [14]: tree.find('{http://www.w3.org/2005/Atom}content').find('articleDoc')  # follow path of elements
Out[14]: <Element articleDoc at 0x7d010c13d9c0>
In [15]: tree.xpath('//tmp_ns:id', namespaces={'tmp_ns': tree.nsmap[None]})  # use xpath, handling default namespace is tedious here
Out[15]: [<Element {http://www.w3.org/2005/Atom}id at 0x7d010c1b06c0>]
In [16]: tree.xpath('//articleDoc')  # find elements not being a direct child
Out[16]: [<Element articleDoc at 0x7d010c13d9c0>]
In [17]: tree.xpath('//@type')  # search for attribute
Out[17]: ['application/xml']
In [18]: tree.xpath('//@xml:lang')  # search for other attribute
Out[18]: ['en']

相关内容

  • 没有找到相关文章

最新更新