如何根据URL检索网页,并将每个网页转换为beautifulsoup对象



所以我正在抓取一个网站,多亏了Andrej Kesely,我才能够获得所有信息,我还能够合成下载前50页的URL,但现在我想根据URL检索网页,并将其转换为一个漂亮的组,我还想检索所有信息和URL(href)来访问详细的汽车信息。我是python和网站抓取的新手,所以我真的不知道从哪里开始,但这里是合成网站前50页的代码

from bs4 import BeautifulSoup
import requests
import os
for i in range(1, 50):
response = requests.get(f"https://jammer.ie/used-cars?page={i}&per-page=12")
with open(f"example{i}.html", "w" , encoding="utf-8") as fp:
fp.write(response.text)
urls = []
prices = []
makes = []
# for loop index by i
with open(f"example{i}.html", "r") as fp:
webpage = fp.read()
soup = BeautifulSoup(webpage, "html.parser")
tables = soup.find_all('div', {"class": "span-9 right-col"})
len(tables[0].contents)

for it in tables[0].contents[1:]:
if it == "n":
continue
for jt in it.findall('div', class_="col-lg-4 col-md-12 car-listing"):
price = jt.find('p', class_="price").text
make = jt.find('h6', class_="car-make").text
url = f"https://jammer.ie/used-cars?page={i}&per-page=12"
urls.append(url)
prices

我知道我必须做一个漂亮的物体,但我真的不知道该怎么办,如果你能解释一下该做什么,那就太好了,谢谢

我想把它放在我能放的地方检索基于这些URL的网页,并将每个网页转换为一个beautifulsoup对象和检索汽车制造年份、发动机、价格、经销商信息(如果可用)和URL(href)以访问详细的汽车信息。

要在多个页面上迭代,可以执行以下操作:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://jammer.ie/used-cars?page={}&per-page=12"
all_data = []
for page in range(1, 3):  # <-- increase number of pages here
soup = BeautifulSoup(requests.get(url.format(page)).text, "html.parser")
for car in soup.select(".car"):
info = car.select_one(".top-info").get_text(strip=True, separator="|")
make, model, year, price = info.split("|")
dealer_name = car.select_one(".dealer-name h6").get_text(
strip=True, separator=" "
)
address = car.select_one(".address").get_text(strip=True)
features = {}
for feature in car.select(".car--features li"):
k = feature.img["src"].split("/")[-1].split(".")[0]
v = feature.span.text
features[f"feature_{k}"] = v
all_data.append(
{
"make": make,
"model": model,
"year": year,
"price": price,
"dealer_name": dealer_name,
"address": address,
"url": "https://jammer.ie"
+ car.select_one("a[href*=vehicle]")["href"],
**features,
}
)
df = pd.DataFrame(all_data)
# prints sample data to screen:
print(df.tail().to_markdown(index=False))
# saves all data to CSV
df.to_csv("data.csv", index=False)

打印:

rice><1.2升><2.0升><1.7升>950欧元。软木<1.0升>
makemodelyeardealer_namefeature_speed/tr>
Skodahttps://jammer.ie/vehicle/165691-skoda-fabia-2014128627英里
福特https://jammer.ie/vehicle/165690-ford-kuga-201699000英里
Hyundaihttps://jammer.ie/vehicle/165689-hyundai-i40-201598000英里达契亚Sandero2016年https://jammer.ie/vehicle/165688-dacia-sandero-201643000英里
福特https://jammer.ie/vehicle/165687-ford-fiesta-201645000英里

最新更新