Python Selenium:如何转到Google搜索URL,而不使页面显示为"not found","access forbidden"或"permission denied"



我刚开始使用Selenium学习网络抓取,现在我正尝试运行谷歌搜索,然后在搜索返回的前5个URL中的每一个上迭代我的代码。

我的谷歌搜索会正常加载,但当我转到任何搜索结果URL时,页面都会显示"未找到"、"禁止访问"或"拒绝权限"页面。如果我手动粘贴URL,也会发生这种情况。如何绕过这种情况?

还是我不正确地转到下一个URL?我当前正在重置driver.get URL。

from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import re
search = '5 most popular dog breeds'
driver = webdriver.Chrome()
driver.get('https://www.google.co.in/#q=' + search)
b = driver.current_url
page = requests.get(b)
soup = BeautifulSoup(page.content, features="lxml")
links = soup.findAll("a")
urlList = []
# Put first 5 URLs of search into array x.
for link in soup.find_all("a",href=re.compile("(?<=/url?q=)(htt.*://.*)")):
urlList.append(re.split(":(?=http)",link["href"].replace("/url?q=","")))
if len(urlList) == 5:
break
driver.get(urlList[0][0])
url = driver.current_url
page = requests.get(url)
pgsource = driver.page_source

您正在正确打开页面。在为a元素获取href时,您似乎正在提取额外的查询参数。

我修改了您的代码,为链接'https?://[a-zA-Z0-9.-/]+'只获取与此正则表达式模式匹配的链接,并且每个web元素只获取1个链接(在您的情况下,有时是2个(。

# Put first 5 URLs of search into array x.
for link in soup.find_all("a", href=re.compile("(?<=/url?q=)(htt.*://.*)")):
r = re.findall(pattern=re.compile('https?://[a-zA-Z0-9.-/]+'), string=link['href'])[0]
urlList.append(r)
if len(urlList) == 5:
break
print(urlList[0])
driver.get(urlList[0])
pgsource = driver.page_source
print(pgsource)

你也可以只使用硒来达到同样的目的,没有美丽的汤,它会看起来像这样:

from selenium import webdriver
search = '5 most popular dog breeds'
driver = webdriver.Chrome()
driver.get('https://www.google.co.in/#q=' + search)
# Using XPath to filter desired elements instead of regex:
links = driver.find_elements_by_xpath("//a[@href!='' and contains(@ping,'/url?sa')]")
urls = []
for link in links[1:6]:
urls += [link.get_attribute('href')]
print(urls[0])
driver.get(urls[0])
pgsource = driver.page_source
print(pgsource)

这对我很有效。我希望它能有所帮助,祝你好运。

相关内容

最新更新