用硒刮下一页的问题



我正试图在谷歌上获取基本信息。我正在使用的代码如下。不幸的是,它没有移到下一页,我不知道为什么。我使用selenium和googlechrome作为浏览器(没有firefox(。你能告诉我我的代码出了什么问题吗?

driver.get('https://www.google.com/advanced_search?q=google&tbs=cdr:1,cd_min:3/4/2020,cd_max:3/4/2020&hl=en')
search = driver.find_element_by_name('q')
search.send_keys('tea')
search.submit()
soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
titles = []
while True:
next_page_btn =driver.find_elements_by_xpath("//a[@id='pnnext']")
for r in result_div:
if len(next_page_btn) <1:
print("no more pages left")
break
else:
try:
title = None
title = r.find('h3')
if isinstance(title,Tag):
title = title.get_text()
print(title)
if title != '' :
titles.append(title)
except:
continue
element =WebDriverWait(driver,5).until(expected_conditions.element_to_be_clickable((By.ID,'pnnext')))
driver.execute_script("return arguments[0].scrollIntoView();", element)
element.click()

我将查询字符串中的q设置为空字符串。使用as_q而不是q作为搜索框名称。并对代码进行了重新排序。我加了一个页数限制,以防止它永远持续下去。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
driver = webdriver.Chrome()
driver.get('https://www.google.com/advanced_search?q=&tbs=cdr:1,cd_min:3/4/2020,cd_max:3/4/2020&hl=en')
search = driver.find_element_by_name('as_q')
search.send_keys('tea')
search.submit()
titles = []
page_limit = 5
page = 0
while True:
soup = BeautifulSoup(driver.page_source, 'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})
for r in result_div:
for title in r.find_all('h3'):
title = title.get_text()
print(title)
titles.append(title)
next_page_btn = driver.find_elements_by_id('pnnext')
if len(next_page_btn) == 0 or page > page_limit:
break
element = WebDriverWait(driver, 5).until(expected_conditions.element_to_be_clickable((By.ID, 'pnnext')))
driver.execute_script("return arguments[0].scrollIntoView();", element)
element.click()
page = page + 1
driver.quit()

最新更新