错误加载所有元素与Beautifulsoup在动态加载页面没有加载更多的按钮



我试图从页面获得所有元素https://www.doce34.cl/g-shock?order=OrderByNameASC&page=",问题是剧本只有8个,每页有24个。我试了好几种方法,但都没有成功。有人能帮帮我吗?由于

import requests
from bs4 import BeautifulSoup as bs
import time
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36",
}
filename1 = time.strftime("%Y%m%d")
hora=time.strftime("%Y%m%d-%H%M")
print (hora)
url = "https://www.doce34.cl/g-shock?order=OrderByNameASC&page="
download_delay = 3
for page in range(1,40):
req = requests.get(url + str(page))
soup = bs(req.text,'html.parser')
contenedor_de_productos = soup.find(class_="vtex-search-result-3-x-gallery flex flex-row flex-wrap items-stretch bn ph1 na4 pl9-l")
lista_de_productos = soup.find_all('div', class_='vtex-search-result-3-x-galleryItem vtex-search-result-3-x-galleryItem--normal pa4')
contador_producto=1
for producto in  lista_de_productos:
print("producto numero " + str(contador_producto))
texto_producto=producto.find(class_="vtex-product-summary-2-x-productNameContainer mv0 vtex-product-summary-2-x-nameWrapper overflow-hidden c-on-base f5").text
texto_producto=texto_producto.replace('n', '').replace('t', '').replace(',', '').replace('"', '').strip()
texto_producto_link = producto.find(class_="vtex-product-summary-2-x-container vtex-product-summary-2-x-container--product-summary vtex-product-summary-2-x-containerNormal vtex-product-summary-2-x-containerNormal--product-summary overflow-hidden br3 h-100 w-100 flex flex-column justify-between center tc").a['href']
texto_producto_precio = producto.find(class_="vtex-product-price-1-x-currencyInteger").text
contador_producto = contador_producto + 1
print(texto_producto)

正如我在评论中提到的,

lista_de_productos = soup.find_all('div', class_='vtex-search-result-3-x-galleryItem vtex-search-result-3-x-galleryItem--normal pa4')
print(len(lista_de_productos))
>> 8

你每页只能得到8个产品,因为HTML内容只包含8个目标类,这些目标类保存着单个产品的详细信息。

你是对的,每个页面包含24个产品,然而,其中只有8个在div元素下可用。

但好消息是,所有24个产品详细信息都以JSON数据的形式存储在script标签中。

完整的解决方案如下:

import json
import time
import requests
from bs4 import BeautifulSoup as bs
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36",
}
filename1 = time.strftime("%Y%m%d")
hora = time.strftime("%Y%m%d-%H%M")
print(hora)
url = "https://www.doce34.cl/g-shock?order=OrderByNameASC&page="
contador_producto=1
# a list that will contain all individual products from all the pages
all_product = []
for page in range(1, 5):
req = requests.get(url + str(page))
soup = bs(req.text, 'html.parser')
json_script_tags = soup.find_all('script', attrs={"type": "application/ld+json"})[-1]
print(len(json_script_tags))
lista_de_productos = json.loads(json_script_tags.text)
print(f"Number of products on page {page}: {len(lista_de_productos['itemListElement'])}")
for producto in lista_de_productos['itemListElement']:
print("producto numero " + str(contador_producto))
product_dict = {
"product_name": producto['item']['name'],
"product_brand": producto['item']['brand']['name'],
"product_image": producto['item']['image'],
"product_description": producto['item']['description'],
"product_link": producto['item']['@id'],
"product_price": producto['item']['offers']['lowPrice']
}
# a dictionary containing individual product information
print(product_dict)
all_product.append(product_dict)
contador_producto += 1
time.sleep(1)
print(all_product)
# check the length of all_product. For example, it will be 96(24*4) for 4 pages.
# print(len(all_product))

希望对你有帮助。