Web抓取:重复的类名意味着我无法从网站指定所需的数据



网站上的分类意味着我不能指定我想要保存的句子。

网址:https://queue-times.com/parks/6/queue_times我用的是python。我想把数据保存在一个理想的世界里,像这样:

name = soup.find('h1', class_="ride-name").text.strip()
queue = soup.find('span', class_="wait-time").text.strip()
reservation = soup.find('span', class_="reservation-time").text.strip()

(这些类名是我编的)

但是我不知道如何使用这些类来得到我想要的。这是车名、排队时间和可用的预订时段。

这是我尝试过的,但我没有成功。

import requests
from bs4 import BeautifulSoup
import csv
url = "https://queue-times.com/parks/6/queue_times"
html = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
rides = soup.find_all(class_="has-text-weight-normal")
output = []
for element in rides:
output.append([element.get_text().strip()])

with open('input.csv', 'w', encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(output)
import pandas as pd
pd.read_csv('input.csv', header=None).T.to_csv('output.csv', header=False, index=False)

输出如下所示:

["A Pirate's Adventure ~ Treasures of the Seven Seas"]
['Jungle Cruise']
['↳ No reservation slots currently available']
['Pirates of the Caribbean']
['↳ Reservation slots available for 20:45']
['Swiss Family Treehouse']
['The Magic Carpets of Aladdin']
['↳ Reservation slots available for 20:20']

最后我的目标是这样的:

tbody> <<tr>
排队时间预约时间
丛林巡航x分钟0
加勒比海盗y分钟00:00

这里有一个选项:

import re
import requests
from collections import defaultdict
from bs4 import BeautifulSoup
import pandas as pd
url = "https://queue-times.com/parks/6/queue_times"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
data = defaultdict(list)
for ride in soup.find_all("a", {"class": "panel-block"}):
rn = ride.find("span", {"class": "has-text-weight-normal"}).text.strip()
qt_tag = ride.find("span", {"class": re.compile("has-text-dark-(.*)")})
qt = qt_tag.text.strip() if qt_tag else None
rt_tag = ride.find("span", {"class": "has-text-grey"})
rt = rt_tag.text.strip() if rt_tag else None

data["Ride"].append(rn) 
data["Queue_Time"].append(qt)
data["Reservation_Time"].append(rt)
df = (pd.DataFrame(data)
.assign(Reservation_Time= lambda x: x["Reservation_Time"]
.str.extract(r"(d{2}:d{2})$", expand=False).shift(-1))
.dropna(subset="Queue_Time").query("Queue_Time.str.contains('min')")
.reset_index(drop=True)
)

输出:

print(df)
Ride Queue Time Reservation Time
0                                Jungle Cruise    70 mins              NaN
1                     Pirates of the Caribbean    30 mins            21:45
2                       Swiss Family Treehouse     5 mins             None
..                                         ...        ...              ...
30                       Tomorrowland Speedway    25 mins              NaN
31  Tomorrowland Transit Authority PeopleMover    20 mins             None
32          Walt Disney's Carousel of Progress     5 mins             None
[42 rows x 3 columns]

相关内容

  • 没有找到相关文章

最新更新