我是Python的新手,正在尽力收集一些XML数据。到目前为止,我为";正常的";xPath和attributes使用find
和get
方法,但我在最后一点上很吃力。
这是XML:的一个示例部分
<root>
<job>
<othernodes>text</othernodes>
<advertiser>INPUT I WANT
<node2>text</node2>
<node3>text</node3>
</advertiser>
<othernodes>text</othernodes>
</job>
这是我剧本的一部分:
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "sample url"
xml_data = requests.get(url).content
soup = BeautifulSoup(xml_data, "xml")
#Find the tag/child
child = soup.find("job")
Company = []
while True:
try:
Company.append(" ".join(child.find('advertiser')))
except:
Company.append(" ")
try:
# Next sibling of child, here: job
child = child.find_next_sibling('job')
except:
break
data = []
data = pd.DataFrame({
"advertiser":Company,
})
如果我打印结果,它不会返回节点广告商的值。我试着解决这个问题,但找不到解决方案。谢谢
下面是您需要的代码
import xml.etree.ElementTree as ET
XML = '''<root>
<job>
<othernodes>text</othernodes>
<advertiser>add1
<node2>text</node2>
<node3>text</node3>
</advertiser>
<othernodes>text</othernodes>
</job>
<job>
<othernodes>text</othernodes>
<advertiser>add2
<node2>text</node2>
<node3>text</node3>
</advertiser>
<othernodes>text</othernodes>
</job>
</root>'''
root = ET.fromstring(XML)
data = [a.text.strip() for a in root.findall('.//advertiser')]
print(data)
输出
['add1', 'add2']
不需要使用ElementTree
,可以使用BeautifulSoup
。
尝试调用返回第一个匹配项的.find_next()
方法:
from bs4 import BeautifulSoup
xml = """<root>
<job>
<othernodes>text</othernodes>
<advertiser>INPUT I WANT
<node2>text</node2>
<node3>text</node3>
</advertiser>
<othernodes>text</othernodes>
</job>"""
soup = BeautifulSoup(xml, "html.parser")
print([soup.find("advertiser").find_next(text=True).strip()])
# Or using `find_all()`
# print([tag.find_next(text=True).strip() for tag in soup.find_all("advertiser")])
输出:
['INPUT I WANT']