从网络驱动程序硒中的多个站点抓取数据



我正在尝试从多个URL中抓取数据并将其保存在csv中,使用我拥有的代码我可以打开所有3个站点,但只有它保存了最后一个链接中的数据,包括其所有页面

data = {}
for j, url in enumerate(urls):
driver.get(url)
for page in range(100):
data = driver.find_elements_by_class_name("gs_ai_t")
with open('pages.csv','a',newline='') as s:
csv_writer =writer(s)
for i in range(len(data)):
nombre = driver.find_elements_by_class_name("gs_ai_name")         
n = nombre[i].text.replace(',','')
csv_writer.writerow([n])

button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators)))
button_link.click()

我已经对您的代码进行了一些重新排列和轻微更改。请参阅答案底部的注释。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as W
from selenium.webdriver.support import expected_conditions as EC
from selenium.common import exceptions as SE
from selenium import webdriver
import time
from csv import writer
#driver = webdriver.Chrome(executable_path=" ")
chrome_path=r"C:UsersgvsteDesktopproyectochromedriver.exe"
driver = webdriver.Chrome(chrome_path)
urls = ['https://scholar.google.com/citations?view_op=view_org&hl=en&authuser=2&org=17388732461633852730', 'https://scholar.google.com/citations?view_op=view_org&hl=en&authuser=2&org=8337597745079551909', 'https://scholar.google.com/citations?view_op=view_org&hl=en&authuser=2&org=6030355530770144394']
# this is the xpath for NEXT button *ENABLED*!
button_locators = "//button[@class='gs_btnPR gs_in_ib gs_btn_half gs_btn_lsb gs_btn_srt gsc_pgn_pnx']"
wait_time = 3
wait = W(driver, wait_time)
# prepare the csv file
with open('pages.csv', 'w', newline='') as s:
csv_writer = writer(s)
headers = ['Nombre','Universidad', 'Mail', 'Citas', 'Tags']
csv_writer.writerow(headers)

for url in urls:
data = {}
driver.get(url)
button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators)))
# while the ENABLED button exists...
while button_link:
try:
wait.until(EC.visibility_of_element_located((By.ID,'gsc_sa_ccl')))  #wait for data parent element to load
data = driver.find_elements_by_class_name("gs_ai_t")
with open('pages.csv','a',newline='') as s:
csv_writer =writer(s)
for i in range(len(data)):
nombre = driver.find_elements_by_class_name("gs_ai_name")
universidad = driver.find_elements_by_class_name("gs_ai_aff")
mail = driver.find_elements_by_class_name("gs_ai_eml")
citas = driver.find_elements_by_class_name("gs_ai_cby")
tags = driver.find_elements_by_class_name("gs_ai_int")
link = driver.find_elements_by_class_name('gs_ai_pho')
n = nombre[i].text.replace(',', '')
u = universidad[i].text.replace(',', '')
m = mail[i].text.replace(',', '')
c = citas[i].text.replace(',', '')
t = tags[i].text.replace(',', '')
l = link[i].get_attribute('href')
csv_writer.writerow([n, u, m, c, t, l])
button_link = wait.until(EC.element_to_be_clickable((By.XPATH, button_locators)))
button_link.click()
# when the ENABLED button no longer exists on the page Selenium WebDriverWait will throw a TimeoutException,
# in which case break the loop and move on to the next url
except SE.TimeoutException:
print(f'Last page parsed for url {url}')
break
driver.quit()

笔记:

  • 避免依赖完整的 xpath。查看更新的button_locators并保留 请记住,一旦您到达最后一页,您将有一个特殊情况 每个网址。
  • 对于每个 url,创建一个新的data字典并将详细信息收集为 只要有一个与button_locators匹配的"下一步"按钮 xpath。
  • try-except块中执行此操作,因为按钮将不存在于 最后一页。
  • .csv追加代码没有更改。
  • 请注意except所需的硒exceptions导入 (SE( 块。

获取网址时没有设置data

data = {}
for j, url in enumerate(urls):
driver.get(url)

您仍然会获得最后一页,因为当您开始解析时,驱动程序位于最后一页:

data = driver.find_elements_by_class_name("gs_ai_t")

解决方案:
实例化一个 Web 驱动程序并获取每个 url 的 url 并从中创建一个字典。

data = {i : webdriver.Chrome(chrome_path).get(url) for i, url in enumerate(urls)}

另一种解决方案:
一旦 Web 驱动程序获取 url,就开始解析数据:

for j, url in enumerate(urls):
driver.get(url)
wait = W(driver, wait_time)
time.sleep(4)
for page in range(100):
data = driver.find_elements_by_class_name("gs_ai_t")
...

最新更新