数据只能交替地从网站正确(不一致地)获取



我正试图从一个网站获得数据,这里是我所做的代码:

这些是模块

import bs4
import pandas as pd
import numpy as np
import random
import requests
from lxml import etree
import time
from tqdm.notebook import tqdm
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotInteractableException
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager

下面是获取每个目标产品的url:

driver = webdriver.Chrome(ChromeDriverManager().install())
for page in tqdm(range(5, 10)):
driver.get("https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page="+str(page)+"&sortBy=pop")

skincare = driver.find_elements(By.XPATH, '//div[@class="col-xs-2-4 shopee-search-item-result__item"]//a[@data-sqe="link"]')
for _skincare in tqdm(skincare):
urls.append({"url":_skincare.get_attribute('href')})
driver.quit()

已成功获取。下面是我接下来做的:

data_final = pd.DataFrame(urls)
driver = webdriver.Chrome(ChromeDriverManager().install())
skincares = []
for product in tqdm(data_final["url"]):
driver.get(product)
try:
company = driver.find_element(By.XPATH,"//div[@class='CKGyuW']//div[@class='_1Yaflp page-product__shop']//div[@class='_1YY3XU']//div[@class='zYQ1eS']//div[@class='_3LoNDM']").text
except:
company = 'none'
try:
product_name = driver.find_element(By.XPATH,"//div[@class='flex flex-auto eTjGTe']//div[@class='flex-auto flex-column  _1Kkkb-']//div[@class='_2rQP1z']//span").text
except:
product_name = 'none'
try:
rating = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB _14izon']").text
except:
rating = 'none'
try:
number_of_ratings = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3T9OoL']//div[@class='_3y5XOB']").text
except:
number_of_ratings = 'none'
try:
sold = driver.find_element(By.XPATH,"//div[@class='flex _3tkSsu']//div[@class='flex _3EOMd6']//div[@class='HmRxgn']").text
except:
sold = 'none'
try:
price = driver.find_element(By.XPATH,"//div[@class='_2Shl1j']").text
except:
price = 'none'
try:
description = driver.find_element(By.XPATH,"//div[@class='_1MqcWX']//p[@class='_2jrvqA']").text
except:
description = 'none'


skincares.append({
"url": product,
"company": company,
"product name": product_name,
"rating": rating,
"number of ratings": number_of_ratings,
"sold": sold,
"price": price,
"description": description,
})
time.sleep(5)

我设置了time.sleep(x)来避免被阻塞,我尝试了x = 1,1.5, 2,5,15。上面代码得到的结果是不一致的。调用

skincares_data = pd.DataFrame(skincares)
skincares_data

输入图片描述

这是一堆空白或未正确获取的数据。如果我重新运行代码,我会得到另一组数据其中一些空白的数据现在有数据了,而一些被正确获取的数据现在是空白的。再运行一次,出现同样的问题。

我认为被"封锁"(我只是使用了time.sleep()来确保)。

评论吗?

我试图从一个网站获取数据,我成功地获得了url,但每个产品的详细信息没有正确获取。有很多空白数据。它们要么变成空白,要么被正确地获取。

当您向下滚动页面时,页面正在动态加载。下面的代码应该可以解决您的问题:

[..]
wait = WebDriverWait(driver, 15)
url='https://shopee.ph/Makeup-Fragrances-cat.11021036?facet=100664&page=1&sortBy=pop'
driver.get(url)
rows= wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[contains(@class, "shopee-search-item-result__item")]')))
for r in rows:
r.location_once_scrolled_into_view
t.sleep(5)
products = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@data-sqe="item"]')))
for p in products:
name = p.find_element(By.XPATH, './/div[@data-sqe="name"]').text.strip()
some_id = p.find_element(By.XPATH, './/a[@data-sqe="link"]').get_attribute('href').split('?sp_atk=')[0].split('-i.')[1]
print(name, some_id)

所有项目将打印在终端:

ORIG M.Q. Cosmetics MACAROON LIP THERAPY LIPBALM WITH SPATULA | MQ
wholesale 10092844.9115684791
Magic Lip Therapy Balm in 10g jar (FREE Spatula) Rebranding NO STICKER! 286498185.11511633880
BIOAQUA COLLAGEN Nourish Lips Membrane Moisturizing Lip Mask moisture nourishing skin care soft 295464315.8585504678
Lip therapy Cosmetic Potion lipbalm
₱5 off
Free Gift 11055729.11663828134
VASELINE Rosy Lip Stick 4.8g 92328166.8130605004
Collagen Crystal lip mask lips plump gel personal care hydrating lip whitening a smacker wrinkle gel 386726777.2925165359
blk cosmetics fresh lip scrub coco crush 62677292.5532509493
[...]

Selenium文档可以在这里找到

最新更新