Python Selenium 抓取崩溃，我可以找到部分网页的元素吗?

我正在尝试从网站上抓取一些数据。该网站有一个"加载更多产品"按钮。我正在使用：

driver.find_element_by_xpath("""//*[@id="showmoreresult"]""").click()

点击按钮，这将循环进行一定次数的迭代。

我遇到的问题是，一旦这些迭代次数完成，我想使用以下方法从网页中提取文本：

posts = driver.find_elements_by_class_name("hotProductDetails")

但是，这似乎使Chrome崩溃，因此我无法获取任何数据。我想做的是，用每次迭代后加载的新产品填充帖子。

单击"加载更多"后，我想从刚刚加载的 50 个产品中抓取文本，附加到我的列表中并继续。

我可以在每次迭代中运行行posts = driver.find_elements_by_class_name("hotProductDetails")，但它每次都会抓取页面上的每个元素，并且确实减慢了该过程。

在硒中是否有实现这一目标，或者我是否限制了使用此库？

这是完整的脚本：

import csv
import time
from selenium import webdriver
import pandas as pd
def CeXScrape():
print('Loading Chrome...')
chromepath = r"C:UsersleonKDocumentsPython Scriptschromedriver.exe"
driver = webdriver.Chrome(chromepath)
driver.get(url)
print('Prepping Webpage...')    
time.sleep(2)    
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
y = 0
BreakClause = ExceptCheck = False    
while y < 1000 and BreakClause == False:
y += 1
time.sleep(0.5)
try:
driver.find_element_by_xpath("""//*[@id="showmoreresult"]""").click()
ExceptCheck = False
print('Load Count', y, '...')
except: 
if ExceptCheck: BreakClause = True
else: ExceptCheck = True
print('Load Count', y, '...Lag...')
time.sleep(2)
continue
print('Grabbing Elements...')
posts = driver.find_elements_by_class_name("hotProductDetails")
cats = driver.find_elements_by_class_name("superCatLink")
print('Generating lists...')
catlist = []
postlist = []    
for cat in cats: catlist.append(cat.text)
print('Categories Complete...')
for post in posts: postlist.append(post.text)
print('Products Complete...')    
return postlist, catlist
prods, cats = CeXScrape()
print('Extracting Lists...')
cat = []
subcat = []
prodname = []
sellprice = []
buycash = []
buyvoucher = []
for c in cats: 
cat.append(c.split('/')[0])
subcat.append(c.split('/')[1])
for p in prods:
prodname.append(p.split('n')[0])
sellprice.append(p.split('n')[2])
if 'WeBuy' in p:
buycash.append(p.split('n')[4])
buyvoucher.append(p.split('n')[6])
else:
buycash.append('NaN')
buyvoucher.append('NaN')    
print('Generating Dataframe...')
df = pd.DataFrame(
{'Category' : cat,
'Sub Category' : subcat,
'Product Name' : prodname,
'Sell Price' : sellprice,
'Cash Buy Price' : buycash,
'Voucher Buy Price' : buyvoucher})
print('Writing to csv...')
df.to_csv('Data.csv', sep=',', encoding='utf-8')
print('Completed!')

使用 XPATH 并限制您获得的产品。如果您每次获得 50 个产品，请使用如下所示的内容

"(//div[@class='hotProductDetails'])[position() > {} and position() <= {}])".format ((page -1 ) * 50, page * 50)

这将每次为您提供50产品，并且您增加页面#以获得下一个批次。一次性完成所有操作无论如何都会崩溃

相关内容

最新更新

热门标签：