使用BeautifulSoup从网站检索到的数据与网站上显示的数据不同



A使用以下代码从web上刮下鞋信息"https://www.adidas.com/us/men-shoes">

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
uri = "men-shoes"
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
html_page = requests.get("https://www.adidas.com/us/" + str(uri), headers=hdr, timeout=15)
soup = BeautifulSoup(html_page.content, 'html.parser')
results = soup.find_all('div', attrs={'class': 'gl-product-card color-variations__fixed-size glass-product-card___17N3p'})

以下是我在刮一只特定鞋子时获得的数据示例:

<div class="gl-product-card color-variations__fixed-size glass-product-card___17N3p">
<div class="gl-product-card__assets">
<a class="gl-product-card__assets-link" data-auto-id="glass-hockeycard-link" href="/us/superstar-shoes/FV2820.html">
<img
alt="Originals Black Superstar Shoes"
class="img_with_fallback___2aHBu gl-product-card__image"
data-auto-id="image"
src="https://assets.adidas.com/images/w_385,h_385,f_auto,q_auto:sensitive,fl_lossy/3c086bf61062470aa54cab8700b26add_9366/superstar-shoes.jpg"
title="Superstar Shoes"
/>
<img
alt="Originals Black Superstar Shoes"
class="img_with_fallback___2aHBu gl-product-card__image-hover"
data-auto-id="image"
src="https://assets.adidas.com/images/w_385,h_385,f_auto,q_auto:sensitive,fl_lossy/d04a49435d094fcfa8dfab960070f1a9_9366/superstar-shoes.jpg"
title="Superstar Shoes"
/>
</a>
<div class="gl-product-card__wishlist">
<div class="toggle_wishlist_button___1dG52" data-auto-id="wishlist-icon-container">
<svg class="gl-icon" data-auto-id="wishlist-icon"><use xlink:href="#wishlist-inactive"></use></svg>
</div>
</div>
</div>
<div class="gl-product-card__carousel">
<div class="product-carousel" data-auto-id="glass-mock-carousel">
<div class="wrapper___3wqg4">
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
<div class="item_wrapper___2toNm"></div>
</div>
</div>
</div>
<div class="gl-product-card__details">
<a class="gl-product-card__details-link" href="/us/superstar-shoes/FV2820.html">
<div class="gl-product-card__details-top">
<div class="gl-product-card__category" title="shoes">Originals</div>
<div class="gl-product-card__details-icons"></div>
</div>
<div class="gl-product-card__details-main">
<span class="gl-label gl-label--m gl-label--condensed gl-product-card__name" title="Superstar Shoes">Superstar Shoes</span>
<div class="gl-price gl-price--s gl-price__inline___-VD1g notranslate"></div>
</div>
<div class="gl-product-card__details-bottom"><div class="gl-product-card__color">18 colors</div></div>
</a>
</div>
</div>

以下是我直接从网站复制鞋子时获得的数据:

<div class="gl-product-card color-variations__fixed-size glass-product-card___17N3p">
<div class="gl-product-card__assets">
<a data-auto-id="glass-hockeycard-link" href="/us/zx-2k-4d-shoes/FW2003.html" class="gl-product-card__assets-link" data-di-id="di-id-93927325-c0dba7aa">
<img
data-auto-id="image"
title="ZX 2K 4D Shoes"
src="https://assets.adidas.com/images/w_385,h_385,f_auto,q_auto:sensitive,fl_lossy/d704fc8256204415b713ab6600f76418_9366/zx-2k-4d-shoes.jpg"
alt="Originals White ZX 2K 4D Shoes"
class="img_with_fallback___2aHBu gl-product-card__image performance-item"
data-inject_ssr_performance_instrument=""
onload="SSR_PERFORMANCE_MEASUREMENT(this)"
/>
<img
data-auto-id="image"
title="ZX 2K 4D Shoes"
src="https://assets.adidas.com/images/w_385,h_385,f_auto,q_auto:sensitive,fl_lossy/e169facc2c554c21b9d1ab880150342a_9366/zx-2k-4d-shoes.jpg"
alt="Originals White ZX 2K 4D Shoes"
class="img_with_fallback___2aHBu gl-product-card__image-hover"
/>
</a>
<div class="gl-product-card__wishlist">
<div class="toggle_wishlist_button___1dG52" data-auto-id="wishlist-icon-container">
<svg class="gl-icon" data-auto-id="wishlist-icon" data-di-res-id="d45e29bb-1d8adc35" data-di-rand="1596955983257"><use xlink:href="#wishlist-inactive"></use></svg>
</div>
</div>
<div class="gl-badge gl-badge--small gl-badge--semi-urgent">New</div>
</div>
<div class="gl-product-card__details">
<a href="/us/zx-2k-4d-shoes/FW2003.html" class="gl-product-card__details-link" data-di-id="di-id-93927325-c0dba7aa">
<div class="gl-product-card__details-top">
<div class="gl-product-card__category" title="shoes">Originals</div>
<div class="gl-product-card__details-icons"></div>
</div>
<div class="gl-product-card__details-main">
<span class="gl-label gl-label--m gl-label--condensed gl-product-card__name" title="ZX 2K 4D Shoes">ZX 2K 4D Shoes</span>
<div class="gl-price gl-price--s gl-price__inline___-VD1g notranslate"><span class="gl-price__value">$200</span></div>
</div>
</a>
</div>
</div>

正如你从网站和复制的数据中看到的那样,它显示了价格,在这种情况下是200美元。如何获取代码以显示鞋的价格?

请包括时间优化的代码,因为我必须刮一百多只鞋。

对于特定的站点,您可能会有更好的时间简单地访问他们用于数据的JSON API。这样,您就不必刮取任何东西,只需解析JSON并读取即可

查看网站的网络检查员

  • 例如。https://www.adidas.com/api/plp/content-engine?sitePath=us&query=男鞋似乎在每一页上都返回一个项目列表
  • 例如。https://www.adidas.com/api/search/product/FW2003?sitePath=us似乎返回了给定产品的信息,包括其价格

这个网站的美妙之处在于,页面中的所有产品都存储在一个名为-DATA_STORE的javascript变量中。如果你能得到这个变量值,你就得到了产品的基本信息。但由于价格与折扣和时间段有关,每次你像懒加载一样向下滚动页面时,网站都会发出ajax调用。

一旦获得productId,就必须对每个productId进行ajax调用,以获取价格信息。

下面的脚本从页面中获取所有产品数据,获取价格信息并将其存储到json中。一旦有了json,就很容易解析json

注意*:要用javascript变量构造json,这有点棘手。您需要编码和解码以删除脚本处理的反斜杠。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests, sys, json, time
uri = "men-shoes"
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
html_page = requests.get("https://www.adidas.com/us/" + str(uri), headers=hdr, timeout=15)
soup = BeautifulSoup(html_page.content, 'html.parser')
script = None
for i in soup.find_all("script"):
if "DATA_STORE" in i.text.strip():
script = i.text.strip()
break
if script is None:
print("no data found")
sys.exit(1)
all_items = json.loads(script[script.index("{"):-3].encode().decode('unicode_escape'))
data = {}
for item in all_items['plp']['itemList']['items']:
print(item["productId"])
res = requests.get("https://www.adidas.com/api/search/product/{}?sitePath=us".format(item["productId"]), headers=hdr, timeout=15)
data[item["productId"]] = res.json()
time.sleep(1)
print(data)
with open("data.json", "w") as f:
json.dump(data,f)

输出:

{'FW2003': {'price': 200, 'badgeStyle': '', 'badgeText': '', 'cached': False, 'salePrice': 200, 'image': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/d704fc8256204415b713ab6600f76418_9366/zx-2k-4d-shoes.jpg', 'cloudinary': True}, 'secondImage': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/e169facc2c554c21b9d1ab880150342a_9366/zx-2k-4d-shoes.jpg', 'cloudinary': True}, 'color': 'Cloud White / Core Black / Signal Pink', 'modelId': 'KYU21', 'orderable': True, 'validFrom': {'default@adidas-PE': '2020-07-01T03:00:00.000Z', 'default@adidas-CL': '2020-07-15T04:00:00.000Z', 'default@adidas-MX': '2020-10-01T17:00:00.000Z', 'default@adidas-CO': '2020-07-15T03:00:00.000Z', 'default@adidas-US': '2020-07-10T07:00:00.000Z', 'default@adidas-BR': '2020-07-01T03:00:00.000Z', 'default@adidas-AR': '2020-07-15T03:00:00.000Z'}, 'previewTo': '2020-07-13T07:00:00.000Z', 'isFlash': False, 'isFinalSale': False, 'isSpecialLaunch': False, 'id': 'FW2003', 'link': '/us/zx-2k-4d-shoes/FW2003.html'}, 'FX7847': {'price': 140, 'badgeStyle': '', 'badgeText': '', 'cached': False, 'salePrice': 140, 'image': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/5fba111ccaab411a9171ab57000ec9e8_9366/climacool-vento-shoes.jpg', 'cloudinary': True}, 'secondImage': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/77e6c85bc38d40e6a912ab52017d1197_9366/climacool-vento-shoes.jpg', 'cloudinary': True}, 'color': 'Signal Cyan / Orbit Grey / Signal Pink', 'modelId': 'LDT02', 'orderable': True, 'validFrom': {'default@adidas-CA': '2020-06-01T05:00:00.000Z', 'default@adidas-US': '2020-05-31T07:00:00.000Z'}, 'previewTo': '2020-06-01T07:00:00.000Z', 'isFlash': False, 'isFinalSale': False, 'isSpecialLaunch': False, 'id': 'FX7847', 'link': '/us/climacool-vento-shoes/FX7847.html'}, 'B42200': {'price': 140, 'badgeStyle': '', 'badgeText': '', 'cached': False, 'salePrice': 140, 'image': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/de7d57ddae474f139736a8ba00fcbfb8_9366/nmd_r1-shoes.jpg', 'cloudinary': True}, 'secondImage': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/ad3348fcdb5f40a0b410a8ba00fc6427_9366/nmd_r1-shoes.jpg', 'cloudinary': True}, 'color': 'Black / Black / Gum', 'modelId': 'BSZ68', 'orderable': True, 'validFrom': {'default': '2018-07-01T04:00:00.000Z', 'default@adidas-PE': '2018-07-01T03:00:00.000Z', 'default@adidas-CL': '2018-07-01T04:00:00.000Z', 'default@adidas-MX': '2018-07-01T05:00:00.000Z', 'default@adidas-CO': '2018-07-01T03:00:00.000Z', 'default@adidas-CA': '2018-06-01T05:00:00.000Z', 'default@adidas-US': '2018-06-01T07:00:00.000Z', 'default@adidas-BR': '2018-07-01T03:00:00.000Z', 'default@adidas-AR': '2019-04-01T03:00:00.000Z'}, 'previewTo': '2012-12-11T22:00:00.000Z', 'isFlash': False, 'isFinalSale': False, 'isSpecialLaunch': False, 'id': 'B42200', 'link': '/us/nmd_r1-shoes/B42200.html'}, 'EF1042': {'price': 180, 'badgeStyle': '', 'badgeText': '', 'cached': False, 'salePrice': 180, 'image': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/1e74db8746cd492b9814aafc0106ac2d_9366/ultraboost-20-shoes.jpg', 'cloudinary': True}, 'secondImage': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/c6c5c2caafc8405b8e4baaff00e21e50_9366/ultraboost-20-shoes.jpg', 'cloudinary': True}, 'color': 'Cloud White / Cloud White / Core Black', 'modelId': 'DVF21', 'orderable': True, 'validFrom': {'default@adidas-CA': '2020-01-01T06:00:00.000Z', 'default@adidas-US': '2020-01-01T08:00:00.000Z'}, 'previewTo': '2012-12-11T22:00:00.000Z', 'isFlash': False, 'isFinalSale': False, 'isSpecialLaunch': False, 'id': 'EF1042', 'link': '/us/ultraboost-20-shoes/EF1042.html'}, 'M20324': {'price': 80, 'badgeStyle': '', 'badgeText': '', 'cached': False, 'salePrice': 80, 'image': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/25c70a990dd74210aa47a59900ebfe5d_9366/stan-smith-shoes.jpg', 'cloudinary': True}, 'secondImage': {'src': 'https://assets.adidas.com/images/w_280,h_280,f_auto,q_auto:sensitive/f7f13f58f83e46698f15aacb01622c54_9366/stan-smith-shoes.jpg', 'cloudinary': True}, 'color': 'Cloud White / Core White / Green', 'modelId': 'ION05', 'orderable': True, 'validFrom': {'default': '2017-08-08T03:00:00.000Z', 'default@adidas-PE': '2017-05-01T03:00:00.000Z', 'default@adidas-CL': '2016-07-05T04:00:00.000Z', 'default@adidas-MX': '2017-01-01T06:00:00.000Z', 'default@adidas-CO': '2016-01-15T02:00:00.000Z', 'default@adidas-CA': '2015-01-27T06:00:00.000Z', 'default@adidas-US': '2014-01-15T08:00:00.000Z', 'default@adidas-BR': '2020-02-11T03:00:00.000Z', 'default@adidas-AR': '2017-08-08T03:00:00.000Z'}, 'previewTo': '2012-12-11T22:00:00.000Z', 'isFlash': False, 'isFinalSale': False, 'isSpecialLaunch': False, 'id': 'M20324', 'link': '/us/stan-smith-shoes/M20324.html'}}
...
...
...

获取价格:

for k,v in a.items():
print(f"ProductId - {k}")
print("Price - {}".format(v["price"]))
print("Sale Price - {}".format(v["salePrice"]))
print("---"*20)

输出:

ProductId - FW2003
Price - 200
Sale Price - 200
------------------------------------------------------------
ProductId - FX7847
Price - 140
Sale Price - 140
------------------------------------------------------------
ProductId - B42200
Price - 140
Sale Price - 140
------------------------------------------------------------
ProductId - EF1042
Price - 180
Sale Price - 180
------------------------------------------------------------
ProductId - M20324
Price - 80
Sale Price - 80
------------------------------------------------------------

您正在抓取的网页在使用javascript加载页面后动态加载其价格。BeautifulSoup只处理HTML,所以解决这个问题的简单方法是加载链接到每张卡的网页,并从中获取鞋的价格,因为每个产品的产品页面不会动态加载它们的价格。更轻量级的方法是使用网站的JSON API。

在这种情况下,您可以从产品列表中加载每个产品的产品ID,然后从网站的JSON API请求其信息,网址为https://www.adidas.com/api/search/product/FW2003,用于产品FW2003。根据该JSON,您可以使用Python的请求模块构建一个字典。

例如:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
uri = "men-shoes"
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
html_page = requests.get("https://www.adidas.com/us/" + str(uri), headers=hdr, timeout=15)
soup = BeautifulSoup(html_page.content, 'html.parser')
results = soup.find_all('div', attrs={'class': 'grid-item___eaXVb'})
ids = []
for res in results:
ids.append(res.get['data-grid-id'])
for id in ids:
url = "https://www.adidas.com/api/search/product/" + id
res = requests.get(url)
price = res.json()['price']
...

最新更新