我正在尝试创建一个网站:www.gall.nl
,以便创建一个该平台上销售的所有葡萄酒的数据库。我有以下代码:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.gall.nl/wijn/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
mydivs = soup.find_all("div", {"class": "c-product-tile"})
print(len(mydivs))
first_wijn = mydivs[0]
print(first_wijn)
result = first_wijn.find()
因此,这提供了12个结果,这是正确的。
打印第一个结果提供以下内容:
<div class="c-product-tile" data-product='{"name":"Faustino V Rioja Reserva","id":"143561","currencyCode":"EUR","price":13.99,"discount":0,"brand":"Faustino","category":"Wijn","variant":"75CL","list":"productoverzicht","position":1,"dimension13":"2","dimension37":"Ja"}' itemprop="item" itemscope="" itemtype="https://schema.org/Product" js-hook-product-tile="">
<meta content="143561" itemprop="sku">
<meta content="8410441412065" itemprop="gtin8">
<meta content="Faustino" itemprop="brand">
<div class="product-tile__header">
<div class="product-tile__category-label">
<div class="m-product-taste-tooltip">
<span aria-label="Classic Red" class="a-tooltip-trigger" data-content="Stevig & Ferm" data-placement="bottom-start" js-hook-tooltip="">
<div class="tooltip-trigger__icon product-taste-tooltip__icon u-taste-profile-icon classic-red-red
....
<input class="add-to-cart-url" type="hidden" value="/on/demandware.store/Sites-gall-nl-Site/nl_NL/Cart-AddProduct"/>
</div>
</meta></meta></meta></div>
我有兴趣从第一行获取数据:
<div class="c-product-tile" data-product='{"name":"Faustino V Rioja Reserva","id":"143561","currencyCode":"EUR","price":13.99,"discount":0,"brand":"Faustino","category":"Wijn","variant":"75CL","list":"productoverzicht","position":1,"dimension13":"2","dimension37":"Ja"}' itemprop="item" itemscope="" itemtype="https://schema.org/Product" js-hook-product-tile="">
为了得到名称、价格和品牌。
有人能帮我检索这些数据吗?
使用beautifulsoup的.attrs.get
从div
中获取data-product
然后,转换为JSON以读取所需的值。
import json
import requests
from bs4 import BeautifulSoup
URL = 'https://www.gall.nl/wijn/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# Get all products
mydivs = soup.find_all("div", {"class": "c-product-tile"})
# Loop through each product
for div in mydivs:
# Get data-product
product = div.attrs.get("data-product", None)
# Convert string to json
jsonProduct = json.loads(product.encode('utf-8').decode('ascii', 'ignore'))
# Show name - brand - price
print('{0:<40} {1:<20} {2:>10}'.format(
jsonProduct['name'],
jsonProduct['brand'],
jsonProduct['price']
))
使用format()
创建3列,上面的代码产生以下输出:
Faustino V Rioja Reserva Faustino 13.99
Mucho Ms Tinto Mucho Mas 5.99
Cantina di Verona Valpolicella Ripasso Terre Di Verona 11.99
Villa Jeantel Villa Jeantel 8.99
Ondarre Rioja Reserva Ondarre 13.59
Valdivieso Chardonnay Valdivieso 5.99
Domaine Lamourie Ros Domaine Lamourie 7.99
Oveja Negra Chardonnay Viognier Oveja Negra 6.59
La Palma Merlot La Palma 6.59
Alamos Chardonnay Alamos 8.99
Les Hautes Pentes ros Les Hautes Pentes 7.99
Piccini Memoro Rosso Piccini 7.29