Python Selenium：如何转到Google搜索URL，而不使页面显示为"not found"，"access forbidden"或"permission denied"

我刚开始使用Selenium学习网络抓取，现在我正尝试运行谷歌搜索，然后在搜索返回的前5个URL中的每一个上迭代我的代码。

我的谷歌搜索会正常加载，但当我转到任何搜索结果URL时，页面都会显示"未找到"、"禁止访问"或"拒绝权限"页面。如果我手动粘贴URL，也会发生这种情况。如何绕过这种情况？

还是我不正确地转到下一个URL？我当前正在重置driver.get URL。

from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import re
search = '5 most popular dog breeds'
driver = webdriver.Chrome()
driver.get('https://www.google.co.in/#q=' + search)
b = driver.current_url
page = requests.get(b)
soup = BeautifulSoup(page.content, features="lxml")
links = soup.findAll("a")
urlList = []
# Put first 5 URLs of search into array x.
for link in soup.find_all("a",href=re.compile("(?<=/url?q=)(htt.*://.*)")):
urlList.append(re.split(":(?=http)",link["href"].replace("/url?q=","")))
if len(urlList) == 5:
break
driver.get(urlList[0][0])
url = driver.current_url
page = requests.get(url)
pgsource = driver.page_source

您正在正确打开页面。在为a元素获取href时，您似乎正在提取额外的查询参数。

我修改了您的代码，为链接'https?://[a-zA-Z0-9.-/]+'只获取与此正则表达式模式匹配的链接，并且每个web元素只获取1个链接(在您的情况下，有时是2个(。

# Put first 5 URLs of search into array x.
for link in soup.find_all("a", href=re.compile("(?<=/url?q=)(htt.*://.*)")):
r = re.findall(pattern=re.compile('https?://[a-zA-Z0-9.-/]+'), string=link['href'])[0]
urlList.append(r)
if len(urlList) == 5:
break
print(urlList[0])
driver.get(urlList[0])
pgsource = driver.page_source
print(pgsource)

你也可以只使用硒来达到同样的目的，没有美丽的汤，它会看起来像这样：

from selenium import webdriver
search = '5 most popular dog breeds'
driver = webdriver.Chrome()
driver.get('https://www.google.co.in/#q=' + search)
# Using XPath to filter desired elements instead of regex:
links = driver.find_elements_by_xpath("//a[@href!='' and contains(@ping,'/url?sa')]")
urls = []
for link in links[1:6]:
urls += [link.get_attribute('href')]
print(urls[0])
driver.get(urls[0])
pgsource = driver.page_source
print(pgsource)

这对我很有效。我希望它能有所帮助，祝你好运。

相关内容

最新更新

热门标签：