通过BS4提取特定的XML值并将其写入数据帧



我是Python的新手,正在尽力收集一些XML数据。到目前为止,我为";正常的";xPath和attributes使用findget方法,但我在最后一点上很吃力。

这是XML:的一个示例部分

<root>
<job>
<othernodes>text</othernodes>
<advertiser>INPUT I WANT
<node2>text</node2>
<node3>text</node3>
</advertiser>
<othernodes>text</othernodes>
</job>

这是我剧本的一部分:

from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "sample url"
xml_data = requests.get(url).content
soup = BeautifulSoup(xml_data, "xml")
#Find the tag/child
child = soup.find("job")
Company = []
while True:
try:
Company.append(" ".join(child.find('advertiser')))
except:
Company.append(" ")
try:
# Next sibling of child, here: job
child = child.find_next_sibling('job')
except:
break

data = []
data = pd.DataFrame({
"advertiser":Company,
})

如果我打印结果,它不会返回节点广告商的值。我试着解决这个问题,但找不到解决方案。谢谢

下面是您需要的代码

import xml.etree.ElementTree as ET
XML = '''<root>
<job>
<othernodes>text</othernodes>
<advertiser>add1
<node2>text</node2>
<node3>text</node3>
</advertiser>
<othernodes>text</othernodes>
</job>
<job>
<othernodes>text</othernodes>
<advertiser>add2
<node2>text</node2>
<node3>text</node3>
</advertiser>
<othernodes>text</othernodes>
</job>
</root>'''
root = ET.fromstring(XML)
data = [a.text.strip() for a in root.findall('.//advertiser')]
print(data)

输出

['add1', 'add2']

不需要使用ElementTree,可以使用BeautifulSoup

尝试调用返回第一个匹配项的.find_next()方法:

from bs4 import BeautifulSoup
xml = """<root>
<job>
<othernodes>text</othernodes>
<advertiser>INPUT I WANT
<node2>text</node2>
<node3>text</node3>
</advertiser>
<othernodes>text</othernodes>
</job>"""
soup = BeautifulSoup(xml, "html.parser")
print([soup.find("advertiser").find_next(text=True).strip()])
# Or using `find_all()`
# print([tag.find_next(text=True).strip() for tag in soup.find_all("advertiser")])

输出:

['INPUT I WANT']

最新更新