lxml.xpath未将元素放入列表中的问题



所以这是我的问题。我试图使用lxml对网站进行web抓取并获取一些信息,但在使用var.xpath命令时找不到信息所属的元素。它正在查找页面,但在使用xpath之后,它什么也找不到。

import requests
from lxml import html
def main():
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# the root of the tracker website
page = html.fromstring(result.content)
print('its getting the element from here', page)

threesRank = page.xpath('//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
print('the 3s rank is: ', threesRank)
if __name__ == "__main__":
main()
OUTPUT:
"D:Python projectsvenvScriptspython.exe" "D:/Python projects/main.py"
its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is:  []
Process finished with exit code 0

"0"旁边的输出;3s等级是:";应该看起来像这个

[<Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>]

因为xpath字符串不匹配,所以page.xpath(..)不会返回任何结果集。很难准确说出你在寻找什么,但考虑到";thresRank";我想你正在寻找所有的表值,即排名等等。

您可以使用Chrome Addon来获得更准确和不言自明的xpath;Xpath助手";。用法:进入站点并激活扩展。按住shift键,抓住你感兴趣的元素。

由于tracker.network.com使用的HTML是通过BootstrapVue(和Moment/Typeahead/jQuery(使用javascript动态构建的,因此动态渲染有时会产生不同的结果,这是一个很大的风险。

我建议您不要抓取渲染的html,而是使用渲染所需的结构化数据,在本例中,这些数据以json形式存储在一个名为__INITIAL_STATE__的JavaScript变量中

import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# Extract everything needed to render the current page. Data is stored as Json in the
# JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"u0 ... }};
json_string = re.search(r"window.__INITIAL_STATE__s?=s?({.*?});", result.text).group(1)
# convert text string to structured json data
rocketleague = json.loads(json_string)
# Save structured json data to a text file that helps you orient yourself and pick
# the parts you are interested in.
with open('rocketleague_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(rocketleague, indent=4, sort_keys=True))
# Access members using names
print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])
# To avoid 'KeyError' when a key is missing or index is out of range, use "with suppress"
# as in the example below:  since there there is no platform no 99, the variable "platform99"
# will be unassigned without throwing a 'keyerror' exception.
from contextlib import suppress
with suppress(KeyError):
platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']
# print platforms used by currentTitle
for platform in rocketleague['titles']['currentTitle']['platforms']:
print(platform['name'])
# print all titles with corresponding platforms
for title in rocketleague['titles']['titles']:
print(f"nTitle: {title['name']}")
for platform in title['platforms']:
print(f"tPlatform: {platform['name']}")

lxml不支持"tbody";。将您的xpath更改为

'//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tr[*]/td[3]/div/div[2]/div[1]/div'

最新更新