网站上的分类意味着我不能指定我想要保存的句子。
网址:https://queue-times.com/parks/6/queue_times我用的是python。我想把数据保存在一个理想的世界里,像这样:
name = soup.find('h1', class_="ride-name").text.strip()
queue = soup.find('span', class_="wait-time").text.strip()
reservation = soup.find('span', class_="reservation-time").text.strip()
(这些类名是我编的)
但是我不知道如何使用这些类来得到我想要的。这是车名、排队时间和可用的预订时段。
这是我尝试过的,但我没有成功。
import requests
from bs4 import BeautifulSoup
import csv
url = "https://queue-times.com/parks/6/queue_times"
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
rides = soup.find_all(class_="has-text-weight-normal")
output = []
for element in rides:
output.append([element.get_text().strip()])
with open('input.csv', 'w', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(output)
import pandas as pd
pd.read_csv('input.csv', header=None).T.to_csv('output.csv', header=False, index=False)
输出如下所示:
["A Pirate's Adventure ~ Treasures of the Seven Seas"]
['Jungle Cruise']
['↳ No reservation slots currently available']
['Pirates of the Caribbean']
['↳ Reservation slots available for 20:45']
['Swiss Family Treehouse']
['The Magic Carpets of Aladdin']
['↳ Reservation slots available for 20:20']
最后我的目标是这样的:
排队时间 | 预约时间 | 丛林巡航 | x分钟 | 0 |
---|---|---|
加勒比海盗 | y分钟 | 00:00 |
这里有一个选项:
import re
import requests
from collections import defaultdict
from bs4 import BeautifulSoup
import pandas as pd
url = "https://queue-times.com/parks/6/queue_times"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
data = defaultdict(list)
for ride in soup.find_all("a", {"class": "panel-block"}):
rn = ride.find("span", {"class": "has-text-weight-normal"}).text.strip()
qt_tag = ride.find("span", {"class": re.compile("has-text-dark-(.*)")})
qt = qt_tag.text.strip() if qt_tag else None
rt_tag = ride.find("span", {"class": "has-text-grey"})
rt = rt_tag.text.strip() if rt_tag else None
data["Ride"].append(rn)
data["Queue_Time"].append(qt)
data["Reservation_Time"].append(rt)
df = (pd.DataFrame(data)
.assign(Reservation_Time= lambda x: x["Reservation_Time"]
.str.extract(r"(d{2}:d{2})$", expand=False).shift(-1))
.dropna(subset="Queue_Time").query("Queue_Time.str.contains('min')")
.reset_index(drop=True)
)
输出:
print(df)
Ride Queue Time Reservation Time
0 Jungle Cruise 70 mins NaN
1 Pirates of the Caribbean 30 mins 21:45
2 Swiss Family Treehouse 5 mins None
.. ... ... ...
30 Tomorrowland Speedway 25 mins NaN
31 Tomorrowland Transit Authority PeopleMover 20 mins None
32 Walt Disney's Carousel of Progress 5 mins None
[42 rows x 3 columns]