如何使用此脚本抓取多个页面?



我有一个URL列表,我做了一个循环,目标是查看所有这些链接并在几个页面上抓取一些数据。

也许是因为我混淆了硒和美化汤,它没有正确完成,但我的脚本给了我输出错误的 csv 文件。

如果我告诉脚本转到 2 页,ouptut 将是 csv 文件,其中包含来自第一页但两次的数据。诸如此类:

输出

如您所见,达布隆而不是两页用硒滚动的评论。

这是我的脚本:

import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime
import time
import random
from selenium import webdriver
import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait     
from selenium.webdriver.common.by import By     
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
PATH = "driverchromedriver.exe"
options = webdriver.ChromeOptions() 
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
driver = webdriver.Chrome(options=options, executable_path=PATH)
driver.get('https://www.tripadvisor.ca/')
driver.maximize_window()
time.sleep(2)
j = 2 #number of pages
for url in linksfinal: 
driver.get(url) 
results = requests.get(url)
comms = []
notes = []
dates = []

soup = BeautifulSoup(results.text, "html.parser")
name = soup.find('h1', class_= '_1mTlpMC3').text.strip()
commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')
for k in range(j): #iterate over n pages
for container in commentary:
comm  = container.find('q', class_ = 'IRsGHoPm').text.strip()
comms.append(comm)
comm1 = str(container.find("div", class_="nf9vGX55").find('span'))
rat = re.findall(r'd+', str(comm1))
rat1 = (str(rat))[2]
notes.append(rat1)
time.sleep(3) 

next = driver.find_element_by_xpath('//a[@class="ui_button nav next primary "]')

next.click()
data = pd.DataFrame({
'comms' : comms,
'notes' : notes,
#'dates' : dates
})
data.to_csv(f"{name}.csv", sep=';', index=False)
time.sleep(3)

我想它一定是我的缩进,但我看不到在哪里?

好吧,您尝试在for k in range(j):循环中迭代 n 个页面,但实际上您仍在迭代containers哪些页面是commentary的成员,而commentary是从soup获取的,而这些页面取自取自results = requests.get(url)results取自 .
换句话说,也许您正在单击带有next.click()的下一步按钮,但您仍在迭代开始时收集的数据, 在requests.get(url)command.
UPD
提供的第一页上,我不确定这是否有效,但我想你应该做这样的事情:


driver.get('https://www.tripadvisor.ca/')
driver.maximize_window()
time.sleep(2)
j = 2 #number of pages
for url in linksfinal: 
driver.get(url) 
results = requests.get(url)
comms = []
notes = []
dates = []

soup = BeautifulSoup(results.text, "html.parser")
name = soup.find('h1', class_= '_1mTlpMC3').text.strip()
commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')
for k in range(j): #iterate over n pages
for container in commentary:
comm  = container.find('q', class_ = 'IRsGHoPm').text.strip()
comms.append(comm)
comm1 = str(container.find("div", class_="nf9vGX55").find('span'))
rat = re.findall(r'd+', str(comm1))
rat1 = (str(rat))[2]
notes.append(rat1)
time.sleep(3) 

next = driver.find_element_by_xpath('//a[@class="ui_button nav next primary "]')

next.click()
soup = BeautifulSoup(results.text, "html.parser")
commentary = soup.find_all('div', class_='_2wrUUKlw _3hFEdNs8')
data = pd.DataFrame({
'comms' : comms,
'notes' : notes,
#'dates' : dates
})
data.to_csv(f"{name}.csv", sep=';', index=False)
time.sleep(3)

最新更新