Noob: webscraper上的合并头,需要拆分为父/子



我只是从一家电信网站上整理了一份手机品牌、型号和价格等的清单。通过尝试和错误,我开始ok(从一个新手的位置),但标题出现作为一个联合字符串,因为它似乎是html设计为个人,而不是父/子或属性每说?

帮助/指导将是感激的…

脚本:

import pandas
import requests
from bs4 import BeautifulSoup
url = 'https://www.vodafone.com.au/mobile-phones'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
headers = soup.find_all('h2')
divs = soup.find_all('div')
model = list(map(lambda h: h.text.strip(), headers))
print(model)

结果:

[
"AppleiPhone 14 Pro Max",
"AppleiPhone 14 Pro",
"AppleiPhone 14 Plus",
"AppleiPhone 14",
"SamsungSamsung Galaxy Z Fold4 5G",
"SamsungSamsung Galaxy Z Flip4 5G",
"SamsungSamsung Galaxy S22 Ultra 5G",
"AppleiPhone 13",
"GoogleGoogle Pixel 7 Pro",
"GoogleGoogle Pixel 7",
"AppleiPhone 12",
"AppleiPhone 11",
"SamsungSamsung Galaxy S22 5G",
"SamsungSamsung Galaxy A13 5G",
"SamsungSamsung Galaxy A13 4G",
"GoogleGoogle Pixel 6 Pro",
"GoogleGoogle Pixel 6a",
"OPPOOPPO Find X5 Pro 5G",
"OPPOOPPO A57 4G",
"OPPOOPPO Reno8 5G",
"SamsungSamsung Galaxy A53 5G",
"SamsungSamsung Galaxy A33 5G",
"SamsungSamsung Galaxy Z Fold3 5G",
"SamsungSamsung Galaxy S21+ 5G",
"SamsungSamsung Galaxy S21 Ultra 5G",
"SamsungSamsung Galaxy S21 FE 5G",
"AppleiPhone SE (3rd gen)",
"TCLTCL 20 Pro 5G",
"MotorolaMotorola moto g62 5G",
"SamsungSamsung Galaxy A73 5G",
"OPPOOPPO Find X5 5G",
"OPPOOPPO Find X5 Lite 5G",
"MotorolaMotorola moto e22i 4G",
"MotorolaMotorola edge 30 pro 5G",
"MotorolaMotorola edge 30 5G",
"Why choose Vodafone?"
]

~理想的结果:

苹果;iPhone 14 Pro Max,苹果;iPhone 14 Pro,苹果;iPhone 14 Plus,苹果;iPhone 14日

headers中选择可以找到manufacturerdevice

import requests
from bs4 import BeautifulSoup
url = 'https://www.vodafone.com.au/mobile-phones'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
headers = soup.find_all('h2')
brand_device = []
for header in headers:
manufacturer_div = header.select('div[class*="__Manufacturer-"]')
device_div = header.select('div[class*="__Name-"]')
if (len(manufacturer_div) > 0 and len(device_div) > 0):
brand_device.append(
f'{manufacturer_div[0].text.strip()};{device_div[0].text.strip()}')
print(brand_device)

请根据需要对结果进行格式化。

最新更新