使用Python BeautifulSoup进行Web抓取,返回与预期不同的HTML



我正试图使用BeautifulSoup在冲浪报告网站上进行一些网页抓取,但在浏览器中查看时,返回的html似乎与html不匹配,这意味着我无法抓取我要查找的数据。我正试图从以下网站上抓取";颤动浪高;类,其中包含本地海浪高度估计值。https://www.surfline.com/surf-report/paradise-beach/584204214e65fad6a7709cc1

import requests
from bs4 import BeautifulSoup
url = "https://www.surfline.com/surf-report/paradise-beach/584204214e65fad6a7709cc1"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
print(soup.select(".quiver-surf-height"))

print语句返回一个空列表。通过阅读返回的html,我发现了一个语句";请打开JavaScript并重新加载页面"我正在按照类中列出的步骤进行操作,所以我不知道如何处理这个响应。欢迎您提供任何意见!

如注释中所述,您所要的数据是动态生成的,但是,您可以查询API来获得您想要的数据。

你所需要的只是surf spot id以及你想要多少天的数据。默认情况下,它以1小时为间隔出现在最后16天。但你也可以更改这些参数。

例如,这将获得每小时提供的最后两天的海浪高度数据。

import datetime
import requests
surf_sopt_id = "584204214e65fad6a7709cc1"
days = "2"
api_url = f"https://services.surfline.com/kbyg/spots/forecasts/wave?spotId={surf_sopt_id}&days={days}&intervalHours=1"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
}
data = requests.get(api_url, headers=headers).json()
for day in data["data"]["wave"]:
_time = (
datetime
.datetime
.fromtimestamp(day['timestamp'])
.strftime('%Y-%m-%d %H:%M:%S')
)
print(f"{_time}")
surf = day["surf"]
print(f"Surf: {surf['min']} - {surf['max']}")
print(f"{surf['humanRelation']}")

输出:

2022-09-25 06:00:00
Surf: 0.9 - 1.4
Waist to shoulder
2022-09-25 07:00:00
Surf: 0.9 - 1.4
Waist to shoulder
2022-09-25 08:00:00
Surf: 0.9 - 1.4
Waist to shoulder
2022-09-25 09:00:00
Surf: 0.9 - 1.2
Waist to chest
2022-09-25 10:00:00
Surf: 0.9 - 1.2
Waist to chest
2022-09-25 11:00:00
Surf: 0.9 - 1.2
Waist to chest
2022-09-25 12:00:00
Surf: 0.9 - 1.2
Waist to chest
2022-09-25 13:00:00
Surf: 0.9 - 1.2
Waist to chest
2022-09-25 14:00:00
Surf: 0.6 - 1.1
Thigh to stomach
2022-09-25 15:00:00
Surf: 0.6 - 1.1
Thigh to stomach
2022-09-25 16:00:00
and more ...

最新更新