我正在自学如何从网站获取信息,我对如何实际使用 lxml 来做到这一点感到困惑。假设我想打印此维基百科页面内容的标题。我首先要:
site=requests.get('https://en.wikipedia.org/wiki/Hamiltonian_mechanics')
tree=html.fromstring(site.content)
但现在我不知道哪个是正确的 xpath。我天真地突出显示了页面上的内容块,我只是简单地将
contents=tree.xpath('//*[@id="toc"]/div/h2')
这当然不会给我我想要的。(我得到一个空数组(。我该怎么做?
from lxml import html
import requests
site=requests.get('https://en.wikipedia.org/wiki/Hamiltonian_mechanics')
tree=html.fromstring(site.content)
contents=tree.xpath('//*[@id="toc"]/div/h2/text()')[0]
print(contents)
您可以在 chrome 中测试 xpath。在chrome中打开"https://en.wikipedia.org/wiki/Hamiltonian_mechanics",然后按 F12.In 控制台,输入$x('//*[@id="toc"]/div/h2/')
,将输出h2元素。如果要显示h2的内容,xpath应该是$x('//*[@id="toc"]/div/h2/text()')
的,结果是内容数组。
如果我理解正确,你想要父标题,如果你分析你的结构:
//div[@id="toc"]/ul/li/a/span[@class="toctext"]
此路径到达所有标题,因此对于检索所有标题,代码将是:
from lxml import html
import requests
site=requests.get('https://en.wikipedia.org/wiki/Hamiltonian_mechanics')
tree=html.fromstring(site.content)
contents=tree.xpath('//div[@id="toc"]/ul/li/a/span[@class="toctext"]/text()')
print(contents)
其中输出为:
['Overview', "Deriving Hamilton's equations", 'As a reformulation of Lagrangian mechanics', 'Geometry of Hamiltonian systems', 'Generalization to quantum mechanics through Poisson bracket', 'Mathematical formalism', 'Riemannian manifolds', 'Sub-Riemannian manifolds', 'Poisson algebras', 'Charged particle in an electromagnetic field', 'Relativistic charged particle in an electromagnetic field', 'See also', 'References', 'External links']
但是如果你还想拥有子标题,你可以得到所有的 li 并迭代:
import requests
import json
from lxml import html
site=requests.get('https://en.wikipedia.org/wiki/Hamiltonian_mechanics')
tree=html.fromstring(site.content)
contents=tree.xpath('//div[@id="toc"]/ul/li')
title_dic = {}
for content in contents:
subcontents = content.xpath('ul/li/a/span[@class="toctext"]/text()')
title_dic[content.xpath('a/span[@class="toctext"]/text()')[0]] = subcontents
print(json.dumps(title_dic, indent = 4))
输出为:
{
"Overview": [
"Basic physical interpretation",
"Calculating a Hamiltonian from a Lagrangian"
],
"Deriving Hamilton's equations": [],
"As a reformulation of Lagrangian mechanics": [],
"Geometry of Hamiltonian systems": [],
"Generalization to quantum mechanics through Poisson bracket": [],
"Mathematical formalism": [],
"Riemannian manifolds": [],
"Sub-Riemannian manifolds": [],
"Poisson algebras": [],
"Charged particle in an electromagnetic field": [],
"Relativistic charged particle in an electromagnetic field": [],
"See also": [],
"References": [
"Footnotes",
"Sources"
],
"External links": []
}
并且您将父标题作为字典的键,并且值是子项(如果存在(。