我正在尝试刮擦https://store.fabspy.com/collections/new-arrivals-beauty for Sapphire Eye Pencil产品,并返回与产品ID相关的信息。到目前为止,我有:
from bs4 import BeautifulSoup
import urllib2
url = 'https://store.fabspy.com/collections/new-arrivals-beauty'
page = BeautifulSoup(url.read())
soup = BeautifulSoup((page))
tag = 'div class="product-content"'
if row in soup.html.body.findAll(tag):
data = row.findAll('id')
if data and 'sapphire' in data[0].text:
print data[4].text
我要收到的信息是以下内容;
<div class="product-content">
<div class="pc-inner">
<div data-handle="clematis-dewdrop-sparkling-eye-pencil-g7454c-sapphire"
data-target="#quick-shop-popup"
class="quick_shop quick-shop-button"
data-toggle="modal"
title="Quick View">
<span>+ Quick View</span>
<span class="json hide">
{
"id":8779050374,
"title":"Clematis - Dewdrop Sparkling Gel Eye Liner Pencil # G7454C**Sapphire**",
"handle":"clematis-dewdrop-sparkling-eye-pencil-g7454c-sapphire",
"description":"u003cdivu003ernrnGel Formula, Rich Colour, Matte Finish, Long-Wearing, Safe for Waterlinernrnu003cbru003enu003c/divu003eu003cdivu003eu003cbru003eu003c/divu003e u003cimg alt="" src="//i.imgur.com/adW5MKl.jpg"u003e",
"published_at":"2016-10-17T20:15:40+08:00",
"created_at":"2016-10-17T20:15:40+08:00",
"vendor":"Clematis",
"type":"Latest,Beauty,New,Makeup,Best, Clematis, Eyes",
"tags":["Beauty","Best","Clematis","Eyes","Latest","Makeup","New"],
"price":4900,
"price_min":4900,
"price_max":4900,
"available":true,
"price_varies":false,
"compare_at_price":7900,
"compare_at_price_min":7900,
"compare_at_price_max":7900,
"compare_at_price_varies":false,
"variants":[{"id":31447937030", "title":"N/A"]
}
最后的id
。请指定我的脚本应该专注于检索此信息的标签,以及如何关键字搜索脚本中的sapphire
颜色及其id
,谢谢!
现有代码中有一些错误。我建议使用requests
代替urllib2
。我还使用re
和json
库。因此,这就是我在您的情况下要做的(阅读说明代码)。
from bs4 import BeautifulSoup
import requests
import re
import json
# URL to scrape
url = 'https://store.fabspy.com/collections/new-arrivals-beauty'
# HTML data of the page
# You can add checks for 404 errors
soup = BeautifulSoup(requests.get(url).text, "lxml")
# Get a list of all elements having `sapphire` in the `data-handle` attribute
sapphire = soup.findAll(attrs={'data-handle': re.compile(r".*sapphire.*")})
# Take first element of this list (I checked, there is just one element)
sapphire = sapphire[0]
# Find class inside this element having JSON data. Taking just first element's text
json_text = sapphire.findAll(attrs={'class': "json"})[0].text
# Converting it to a dictionary
data = json.loads(json_text)
print data["id"]