使用lxml-xpath解析xml文件



我正在使用lxmlXPath解析以下xml文件

<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>
https://www.reuters.com/article/us-campbellsoup-thirdpoint/campbell-soup-nears-deal-with-third-point-to-end-board-challenge-sources-idUSKCN1NU11I
</loc>
<image:image>
<image:loc>
https://www.reuters.com/resources/r/?m=02&d=20181126&t=2&i=1328589868&w=&fh=&fw=&ll=460&pl=300&r=LYNXNPEEAO0WM
</image:loc>
</image:image>
<news:news>
<news:publication>
<news:name>Reuters</news:name>
<news:language>eng</news:language>
</news:publication>
<news:publication_date>2018-11-26T02:55:00+00:00</news:publication_date>
<news:title>
Campbell Soup nears deal with Third Point to end board challenge: sources
</news:title>
<news:keywords>Headlines,Business, Industry</news:keywords>
<news:stock_tickers>NYSE:CPB</news:stock_tickers>
</news:news>
</url>
</urlset>

Python代码示例

import lxml.etree
import lxml.html
import requests
def main():
r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
namespace = "http://www.google.com/schemas/sitemap-news/0.9"
root = lxml.etree.fromstring(r.content)

records = root.xpath('//news:title', namespaces = {"news": "http://www.google.com/schemas/sitemap-news/0.9"})
for record in records:
print(record.text)

records = root.xpath('//sitemap:loc', namespaces = {"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9"})
for record in records:
print(record.text)

if __name__ == "__main__":
main()

目前,我使用XPath来获取所有URL标题,但这不是我想要的,因为我不知道哪个URL属于哪个标题。我的问题是如何获得每个<url>,然后将每个<url>作为项目循环以获得相应的<loc><news:keywords>等。谢谢!

编辑:期望输出

foreach <url>
get <loc>
get <news:publication_date>
get <news:title>

使用相对XPath从每个标题获取到其关联的URL:

ns = {
"news": "http://www.google.com/schemas/sitemap-news/0.9",
"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
"image": "http://www.google.com/schemas/sitemap-image/1.1"
}
r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = lxml.etree.fromstring(r.content)
for title in root.xpath('//news:title', namespaces=ns):
print(title.text)
loc = title.xpath('ancestor::sitemap:url/sitemap:loc', namespaces=ns)
print(loc[0].text)

练习:重写此内容,改为从URL获取相关标题。

注意:标题(可能还有URL(似乎是HTML转义的。使用unescape()功能

from html import unescape

取消对他们的关注。

答案是

from datetime import datetime
from html import unescape
from lxml import etree
import requests
r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = etree.fromstring(r.content)
ns = {
"news": "http://www.google.com/schemas/sitemap-news/0.9",
"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
"image": "http://www.google.com/schemas/sitemap-image/1.1"
}
for url in root.iterfind("sitemap:url", namespaces=ns):
loc = url.findtext("sitemap:loc", namespaces=ns)
print(loc)
title = unescape(url.findtext("news:news/news:title", namespaces=ns))
print(title)
date = unescape(url.findtext("news:news/news:publication_date", namespaces=ns))
date = datetime.strptime(date, '%Y-%m-%dT%H:%M:%S+00:00')
print(date)

经验法则是:

尽量不要使用xpath。不要使用xpath,而是使用find、findall、iterfind。xpath是一种比find、findall或iterfind更复杂的算法,它需要更多的时间和资源。

使用iterfind,而不是使用findall。因为iterfind将返回项。也就是说,它将一次返回一个项目。因此,它使用较少的内存。

如果只需要文本,请使用findtext

更普遍的规则是阅读官方文件。

首先,让我们为循环函数创建3,并对它们进行比较。

def for1():
for url in root.iterfind("sitemap:url", namespaces=ns):
pass
def for2():
for url in root.findall("sitemap:url", namespaces=ns):
pass
def for3():
for url in root.xpath("sitemap:url", namespaces=ns):
pass
函数时间
root.iterfind70.5µs±543 ns
root.findall72.3µs±839 ns
root.xpath84.8µs±567 ns

最新更新