用python将Script标记的原始数据解析为csv



基本上,我从web上抓取脚本标签上可用的数据,但我无法将数据提取到正确的布局中,有我的脚本标签原始数据

{
"@context": "https://schema.org/",
"@type": "Product",
"name": "I Got Toddler Problems Tee",
"url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136",
"sku": "BMRSUQNGGS",
"image": [
"https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsmauv.png",
"https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsltgray.png",
"https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemsblk.png",
"https://cdn.inspireuplift.com/uploads/images/seller_product_variant_images/i-got-toddler-problems-tee-3136/1629196991_Toddlerproblemspk.png",
],
"description": "BMRSUQNGGS",
"brand": {"@type": "Thing", "name": "InspireUplift"},
"aggregateRating": {"@type": "AggregateRating", "ratingValue": 0, "reviewCount": 0},
"offers": {
"@type": "AggregateOffer",
"highPrice": 32.97,
"lowPrice": 29.97,
"offerCount": 24,
"priceCurrency": "USD",
"offers": [
{
"@type": "Offer",
"url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37621",
"priceCurrency": "USD",
"sku": "BMRSUQNGGS-1",
"alternateName": "I Got Toddler Problems Tee - Mauve/S",
"price": 29.97,
"priceValidUntil": "2022-01-10",
"availability": "https://schema.org/InStock",
"seller": {"@type": "Organization", "name": "InspireUplift"},
},
{
"@type": "Offer",
"url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37622",
"priceCurrency": "USD",
"sku": "BMRSUQNGGS-2",
"alternateName": "I Got Toddler Problems Tee - Mauve/M",
"price": 29.97,
"priceValidUntil": "2022-01-10",
"availability": "https://schema.org/InStock",
"seller": {"@type": "Organization", "name": "InspireUplift"},
},
{
"@type": "Offer",
"url": "https://www.inspireuplift.com/I-Got-Toddler-Problems-Tee/iu/3136?variant=37623",
"priceCurrency": "USD",
"sku": "BMRSUQNGGS-3",
"alternateName": "I Got Toddler Problems Tee - Mauve/L",
"price": 29.97,
"priceValidUntil": "2022-01-10",
"availability": "https://schema.org/InStock",
"seller": {"@type": "Organization", "name": "InspireUplift"},
},
],
"shippingDetails": {
"@type": "OfferShippingDetails",
"shippingRate": {
"@type": "MonetaryAmount",
"value": "0",
"currency": "USD",
},
},
},
}

我想通过提取变体url来提取所有变体名称、图像url、大小和颜色我想这样回来我想要这个布局中的数据任何人都可以帮我,我正在学习python这是我的代码

r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
scripts = soup.find('script', type='application/ld+json').string
data = json.loads(scripts)
image = data["image"]
try:
altname = data["offers"]["offers"]
except KeyError:
print("not found")
for item in altname:
area = item["alternateName"]
detail = {"image": image, "name": area}
print(detail)
newlist.append(detail)
print("saving")
df = pd.DataFrame(newlist)
df.to_csv("first_list.csv")

我回来了,所有的图像都在一个单元格中,尽管有不同的颜色i、 i’我从这边回来

该解决方案基于一个json文件(一个产品(提供。上传的两张截图相同。最好使用data.get('key')而不是data['key']

[data.get("name")] + [""] * (len(offer) - 1)创建相同长度的列,否则我们在创建数据帧时会出错,因为产品名称第一次出现在单元格中。

r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
scripts = soup.find('script', type='application/ld+json').string
# if below line did not work try with data = json.loads(scripts)
data = json.loads(json.dumps(scripts))
size, color, url = [], [], []
offer = data.get("offers").get("offers")
product_name = [data.get("name")] + [""] * (len(offer) - 1)
if offer:
for item in offer:
size_color_list = item["alternateName"].split(" - ")[1].split("/")
url.append(item["url"])
color.append(size_color_list[0])
size.append(size_color_list[1])
detail = {
"product_name": product_name,
"variant_color_name": color,
"variant_size": size,
"variant_image": url,
}
try:
df = pd.DataFrame(detail)
except Exception as e:
raise e
else:
df.index += 1
# df.to_csv('first_list.csv')
df.to_excel("first_list.xlsx")

最新更新