如何根据URL检索网页，并将每个网页转换为beautifulsoup对象

所以我正在抓取一个网站，多亏了Andrej Kesely，我才能够获得所有信息，我还能够合成下载前50页的URL，但现在我想根据URL检索网页，并将其转换为一个漂亮的组，我还想检索所有信息和URL(href)来访问详细的汽车信息。我是python和网站抓取的新手，所以我真的不知道从哪里开始，但这里是合成网站前50页的代码

from bs4 import BeautifulSoup
import requests
import os
for i in range(1, 50):
response = requests.get(f"https://jammer.ie/used-cars?page={i}&per-page=12")
with open(f"example{i}.html", "w" , encoding="utf-8") as fp:
fp.write(response.text)
urls = []
prices = []
makes = []
# for loop index by i
with open(f"example{i}.html", "r") as fp:
webpage = fp.read()
soup = BeautifulSoup(webpage, "html.parser")
tables = soup.find_all('div', {"class": "span-9 right-col"})
len(tables[0].contents)

for it in tables[0].contents[1:]:
if it == "n":
continue
for jt in it.findall('div', class_="col-lg-4 col-md-12 car-listing"):
price = jt.find('p', class_="price").text
make = jt.find('h6', class_="car-make").text
url = f"https://jammer.ie/used-cars?page={i}&per-page=12"
urls.append(url)
prices

我知道我必须做一个漂亮的物体，但我真的不知道该怎么办，如果你能解释一下该做什么，那就太好了，谢谢

我想把它放在我能放的地方检索基于这些URL的网页，并将每个网页转换为一个beautifulsoup对象和检索汽车制造年份、发动机、价格、经销商信息(如果可用)和URL(href)以访问详细的汽车信息。

要在多个页面上迭代，可以执行以下操作：

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://jammer.ie/used-cars?page={}&per-page=12"
all_data = []
for page in range(1, 3):  # <-- increase number of pages here
soup = BeautifulSoup(requests.get(url.format(page)).text, "html.parser")
for car in soup.select(".car"):
info = car.select_one(".top-info").get_text(strip=True, separator="|")
make, model, year, price = info.split("|")
dealer_name = car.select_one(".dealer-name h6").get_text(
strip=True, separator=" "
)
address = car.select_one(".address").get_text(strip=True)
features = {}
for feature in car.select(".car--features li"):
k = feature.img["src"].split("/")[-1].split(".")[0]
v = feature.span.text
features[f"feature_{k}"] = v
all_data.append(
{
"make": make,
"model": model,
"year": year,
"price": price,
"dealer_name": dealer_name,
"address": address,
"url": "https://jammer.ie"
+ car.select_one("a[href*=vehicle]")["href"],
**features,
}
)
df = pd.DataFrame(all_data)
# prints sample data to screen:
print(df.tail().to_markdown(index=False))
# saves all data to CSV
df.to_csv("data.csv", index=False)

打印：

rice><1.2升><2.0升><1.7升>950欧元。软木<1.0升>

make	year	dealer_name	feature_speed/tr>
Skoda	https://jammer.ie/vehicle/165691-skoda-fabia-2014	128627英里
福特	https://jammer.ie/vehicle/165690-ford-kuga-2016	99000英里
Hyundai	https://jammer.ie/vehicle/165689-hyundai-i40-2015	98000英里	达契亚	Sandero	2016年	https://jammer.ie/vehicle/165688-dacia-sandero-2016	43000英里
福特	https://jammer.ie/vehicle/165687-ford-fiesta-2016	45000英里

相关内容

最新更新

热门标签：