Beautifulsoup没有返回网页上的所有文本



试图浏览网站,但Beautifulsoup在浏览网页时不会返回所有可见的文本。请参阅以下代码:

import requests
from bs4 import BeautifulSoup
f = open("data.txt", "w")
url = "https://www.hiltongrandvacations.com/en/resorts-and-destinations"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html5lib')
f.write(str(soup))
f.close()   

例如,以下文本在网页上可见,但Beautifulsoup不会返回(写入文本文件(:大太平洋帕利塞德度假酒店

我尝试了不同的解析器(html,lxml(,但仍然没有得到。此外,文本似乎不是由Javascript生成的,我可能错了。

您看到的数据是通过JavaScript动态加载的。您可以使用此示例加载数据:

import json
import requests

payload = {"locations":[],"amenities":[],"vacationTypes":[],"page":1,"pageSize":9}
api_url = 'https://www.hiltongrandvacations.com/sitecore/api/ssc/apps/PropertySearch'
data = requests.put(api_url, json=payload).json()
# uncomment this to prin all data:
# print(json.dumps(data, indent=4))
# print some info on screen:
for card in data['Cards']:
print(card['Title'])
print(card['Description'])
print('-' * 80)

打印:

Sunrise Lodge, a Hilton Grand Vacations Club
Revel in the peak of adventure
--------------------------------------------------------------------------------
The District by Hilton Club
A capital experience in the capital city
--------------------------------------------------------------------------------
The Central at 5th by Hilton Club
At the heart of city life
--------------------------------------------------------------------------------
The Hilton Club – New York
Make a break for the Big Apple.
--------------------------------------------------------------------------------
The Residences by Hilton Club
Wake up in the city that never sleeps.
--------------------------------------------------------------------------------
Grand Pacific Palisades Vacation Resort
A window to the Pacific Ocean. 
--------------------------------------------------------------------------------
Carlsbad Seapointe Resort
A quintessentially Californian vacation
--------------------------------------------------------------------------------
Hilton Grand Vacations Chicago Downtown/Magnificent Mile
A sky-high sanctuary amidst the big-city bustle
--------------------------------------------------------------------------------
Hilton Grand Vacations Club at Trump International Hotel Las Vegas
--------------------------------------------------------------------------------

以下是使用selenium解析此网页的示例。它允许您模拟用户行为:等待页面加载,向下滚动到位置,激活位置下拉按钮,选择其中一个位置(本例中为犹他州(,单击它,等待新页面加载并从中提取一些信息。

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
#chrome_options.add_argument('--no-sandbox')
wd = webdriver.Chrome('<PATH_TO_CHROME_DRIVER>',chrome_options=chrome_options)
# delay (how long selenium waits for element to be loaded)
DELAY = 30
# maximize browser window
wd.maximize_window()
# load page via selenium
wd.get("https://www.hiltongrandvacations.com/en/resorts-and-destinations")
# wait until results table will be loaded
results = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//span[contains(text(), "Results")]')))
# find locations button, scroll down to it, click it
locations_button = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//button[contains(text(), "Locations")]')))
wd.execute_script("arguments[0].scrollIntoView();", locations_button)
wd.execute_script("arguments[0].click();", locations_button)
# find utah checkbox, scroll down to it, click it
utah_checkbox = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//span[contains(text(), "Utah")]')))
wd.execute_script("arguments[0].scrollIntoView();", utah_checkbox)
wd.execute_script("arguments[0].click();", utah_checkbox)
# find link to utah
utah_link = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.XPATH, '//a[@title="Sunrise Lodge, a Hilton Grand Vacations Club Park City, Utah, Revel in the peak of adventure"]')))
wd.execute_script("arguments[0].scrollIntoView();", utah_link)
wd.execute_script("arguments[0].click();", utah_link)
# find description
description = WebDriverWait(wd, DELAY).until(EC.presence_of_element_located((By.CLASS_NAME, 'image-and-intro__description')))
print(description.text)

在硒不够的情况下,也可以将其与BeautifulSoup组合使用。

可以尝试:

soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

应该会返回网页上的所有内容

最新更新