Python 使用不寻常的标记名称 (atom:link) 解析 XML

我正在尝试从下面的XML中解析出href。有多个workspace标签，下面我只显示一个。

<workspaces>
  <workspace>
    <name>practice</name>
    <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/workspaces/practice.xml" type="application/xml"/>
  </workspace>
</workspaces>

以上来自使用请求库的requests.get命令：

myUrl = 'https://www.my-geoserver.com/geoserver/rest/workspaces'
headers = {'Accept': 'text/xml'}
resp = requests.get(myUrl,auth=('admin','password'),headers=headers)

如果我搜索"工作区"，我会得到返回的对象：

lst = tree.findall('workspace')
print(lst)

这导致：

[<Element 'workspace' at 0x039E70F0>, <Element 'workspace' at 0x039E71B0>, <Element 'workspace' at 0x039E7240>]

好的，但是我如何从字符串中取出文本href，我已经尝试过：

lst = tree.findall('atom')
lst = tree.findall('atom:link')
lst = tree.findall('workspace/atom:link')

但是它们都无法隔离标签，实际上最后一个会产生错误

语法错误：在前缀映射中找不到前缀"atom">

如何获取具有这些标签名称的所有 href 实例？

我发现的简单解决方案：

>>> y=BeautifulSoup(x)
>>> y
<workspaces>
<workspace>
<name>practice</name>
<atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/workspaces/practice.xml" type="application/xml">
</atom:link></workspace>
</workspaces>
>>> c = y.workspaces.workspace.findAll("atom:link")
>>> c
[<atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/workspaces/practice.xml" type="application/xml">
</atom:link>]
>>>

对于发现此问题的其他人，冒号(在本例中为 atom(之前的部分称为命名空间，并在此处导致问题。解决方案非常简单：

myUrl = 'https://www.my-geoserver.com/geoserver/rest/workspaces'
headers = {'Accept': 'text/xml'}
resp = requests.get(myUrl,auth=('admin','my_password'),headers=headers)
stuff = resp.text
to_parse=BeautifulSoup(stuff, "xml")
for item in to_parse.find_all("atom:link"):
    print(item)

感謝Saket Mittal指導我到BeautifulSoup圖書館。关键是使用 xml 作为 BeautifulSoup 函数中的参数。使用 lxml 根本无法正确解析命名空间并忽略它们。

相关内容

最新更新

热门标签：