tree.xpath返回空列表



我正在尝试编写一个可以抓取给定网站的程序。到目前为止,我有这个:

from lxml import html
import requests
page = requests.get('https://www.cruiseplum.com/search#{"numPax":2,"geo":"US","portsMatchAll":true,"numOptionsShown":20,"ppdIncludesTaxTips":true,"uiVersion":"split","sortTableByField":"dd","sortTableOrderDesc":false,"filter":null}')
tree = html.fromstring(page.content)
date = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[1]/text()')
ship = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[2]/text()')
length = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[4]/text()')
meta = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[6]/text()')
price = tree.xpath('//*[@id="listingsTableSplit"]/tr[2]/td[7]/text()')
print('Date: ', date)
print('Ship: ', ship)
print('Length: ', length)
print('Meta: ', meta)
print('Price: ', price)

运行此操作时,列表将返回空。

我对python和编码很陌生,非常感谢你们能提供的任何帮助!

感谢

首先,您使用的链接不正确;这是正确的链接(点击"是"按钮后(网站将下载数据并将其返回到此链接(:

https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}

其次,当您使用请求来获取响应对象时,表中隐藏的内容数据不会返回:

from lxml import html
import requests
u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
r = requests.get(u)
t = html.fromstring(r.content)
for i in t.xpath('//tr//text()'):
print(i)

这将返回:

Recent update: new computer-optimized interface and new filters
Want to track your favorite cruises?
Login or sign up to get started.
Login / Sign Up
Loading...
Email status
Unverified
My favorites & alerts
Log out
Want to track your favorite cruises?
Login or sign up to get started.
Login / Sign Up
Loading...
Email status
Unverified
My favorites & alerts
Log out
Date Colors:
(vs. selected)
Lowest Price
Lower Price
Same Price
Higher Price

即使使用requests_html,内容仍然隐藏

from requests_html import HTMLSession
session = HTMLSession()
r = session.get(u)

您需要使用selenium:访问隐藏的html内容

from lxml import html
from selenium import webdriver
import time
u = 'https://www.cruiseplum.com/search#{%22numPax%22:2,%22geo%22:%22US%22,%22portsMatchAll%22:true,%22numOptionsShown%22:20,%22ppdIncludesTaxTips%22:true,%22uiVersion%22:%22split%22,%22sortTableByField%22:%22dd%22,%22sortTableOrderDesc%22:false,%22filter%22:null}'
driver = webdriver.Chrome(executable_path=r"C:chromedriver.exe")
driver.get(u)
time.sleep(2)
driver.find_element_by_id('restoreSettingsYesEncl').click()
time.sleep(10) #wait until the website downoad data, without this we can't move on
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("innerHTML")
t = html.fromstring(source_code)
for i in t.xpath('//td[@class="dc-table-column _1"]/text()'):
print(i.strip())
driver.quit()

返回第一列(船只名称(:

Costa Luminosa
Navigator Of The Seas
Navigator Of The Seas
Carnival Ecstasy
Carnival Ecstasy
Carnival Ecstasy
Carnival Victory
Carnival Victory
Carnival Victory
Costa Favolosa
Costa Favolosa
Costa Favolosa
Costa Smeralda
Carnival Inspiration
Carnival Inspiration
Carnival Inspiration
Costa Smeralda
Costa Smeralda
Disney Dream
Disney Dream

正如您所看到的,现在可以使用selenium 的get_attribute("innerHTML"(访问表中的内容

下一步是刮取行(船只、路线、日期、地区..(,并将其存储在csv文件(或任何其他格式(中,然后对所有4051页都这样做。

问题似乎出在你导航到的URL上。在浏览器中导航到该URL会出现一个提示,询问你是否要恢复书签搜索。

我想不出一个简单的办法。单击"是"将导致javascript操作,而不是使用不同参数进行实际重定向。

我建议用硒之类的东西来实现这一点。

最新更新