所以我正在抓取一个网站,多亏了Andrej Kesely,我才能够获得所有信息,我还能够合成下载前50页的URL,但现在我想根据URL检索网页,并将其转换为一个漂亮的组,我还想检索所有信息和URL(href)来访问详细的汽车信息。我是python和网站抓取的新手,所以我真的不知道从哪里开始,但这里是合成网站前50页的代码
from bs4 import BeautifulSoup
import requests
import os
for i in range(1, 50):
response = requests.get(f"https://jammer.ie/used-cars?page={i}&per-page=12")
with open(f"example{i}.html", "w" , encoding="utf-8") as fp:
fp.write(response.text)
urls = []
prices = []
makes = []
# for loop index by i
with open(f"example{i}.html", "r") as fp:
webpage = fp.read()
soup = BeautifulSoup(webpage, "html.parser")
tables = soup.find_all('div', {"class": "span-9 right-col"})
len(tables[0].contents)
for it in tables[0].contents[1:]:
if it == "n":
continue
for jt in it.findall('div', class_="col-lg-4 col-md-12 car-listing"):
price = jt.find('p', class_="price").text
make = jt.find('h6', class_="car-make").text
url = f"https://jammer.ie/used-cars?page={i}&per-page=12"
urls.append(url)
prices
我知道我必须做一个漂亮的物体,但我真的不知道该怎么办,如果你能解释一下该做什么,那就太好了,谢谢
我想把它放在我能放的地方检索基于这些URL的网页,并将每个网页转换为一个beautifulsoup对象和检索汽车制造年份、发动机、价格、经销商信息(如果可用)和URL(href)以访问详细的汽车信息。
要在多个页面上迭代,可以执行以下操作:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://jammer.ie/used-cars?page={}&per-page=12"
all_data = []
for page in range(1, 3): # <-- increase number of pages here
soup = BeautifulSoup(requests.get(url.format(page)).text, "html.parser")
for car in soup.select(".car"):
info = car.select_one(".top-info").get_text(strip=True, separator="|")
make, model, year, price = info.split("|")
dealer_name = car.select_one(".dealer-name h6").get_text(
strip=True, separator=" "
)
address = car.select_one(".address").get_text(strip=True)
features = {}
for feature in car.select(".car--features li"):
k = feature.img["src"].split("/")[-1].split(".")[0]
v = feature.span.text
features[f"feature_{k}"] = v
all_data.append(
{
"make": make,
"model": model,
"year": year,
"price": price,
"dealer_name": dealer_name,
"address": address,
"url": "https://jammer.ie"
+ car.select_one("a[href*=vehicle]")["href"],
**features,
}
)
df = pd.DataFrame(all_data)
# prints sample data to screen:
print(df.tail().to_markdown(index=False))
# saves all data to CSV
df.to_csv("data.csv", index=False)
打印:
make | model | year | rice>dealer_name | feature_speed/tr> | |||||
---|---|---|---|---|---|---|---|---|---|
Skoda | https://jammer.ie/vehicle/165691-skoda-fabia-2014 | 128627英里 | <1.2升>|||||||
福特 | https://jammer.ie/vehicle/165690-ford-kuga-2016 | 99000英里 | <2.0升>|||||||
Hyundai | https://jammer.ie/vehicle/165689-hyundai-i40-2015 | 98000英里 | <1.7升>达契亚 | Sandero | 2016年 | 950欧元。软木https://jammer.ie/vehicle/165688-dacia-sandero-2016 | 43000英里 | ||
福特 | https://jammer.ie/vehicle/165687-ford-fiesta-2016 | 45000英里 | <1.0升>