网络爬虫无法从谷歌搜索中检索结果



我正在创建一个简单的网络爬虫,我希望它能抓取谷歌搜索查询(如"唐纳德·特朗普"(的结果网页。我已经写了以下代码:

# import requests
from urllib.request import urlopen as uReq
import urllib.request
from bs4 import BeautifulSoup as soup
paging_url = "https://www.google.gr/search? 
ei=fvtMW8KMI4vdwQLS67yICA&q=donald+trump&oq=donald+trump&gs_l=psy- ab.3..35i39k1j0i131k1j0i203k1j0j0i203k1j0l3j0i203k1l2.4578.6491.0.6763.12.9.0.0.0.0.447.879.4-2.2.0....0...1c.1.64.psy-ab..10.2.878....0.aB3Y8R5B0U8"
req = urllib.request.Request("https://www.google.gr/search?ei=fvtMW8KMI4vdwQLS67yICA&q=donald+trump&oq=donald+trump&gs_l=psy-ab.3..35i39k1j0i131k1j0i203k1j0j0i203k1j0l3j0i203k1l2.4578.6491.0.6763.12.9.0.0.0.0.447.879.4-2.2.0....0...1c.1.64.psy-ab..10.2.878....0.aB3Y8R5B0U8", headers={'User-Agent': "Magic Browser"})
UClient = uReq(req)  # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
results = page_soup.findAll("div", {"class": "srg"})
print(len(results))

稍微解释一下我的想法以及我对谷歌页面结构的注意:

我试图只得到搜索结果,而不是谷歌也显示的推荐视频或图像。当推荐的视频或图像出现时,在带有"srg"类的两个"div"标签下存在九个结果。在这些"div"标签之间插入了另一个带有推荐视频/图像的"div"标记。

我的问题是,属于"srg"类的"div"标记不能被我的代码"看到"。我不知道为什么BeautifulSoup忽略了它们。同样的事情也发生在属于"rc"的"div"标记上班有人知道为什么会发生这种事吗?

我在使用PhantomJS制作网络爬虫来提取谷歌搜索数据时遇到了一些问题。有时我可以浏览几页,然后系统就会丢失。在某些情况下,我会看到在生成的代码中,我似乎在执行非法操作,我应该使用付费的API"自定义搜索JSON API"。我找到的解决方案是从雅虎网站创建爬虫。万一结果对我来说令人满意。

谷歌API让你每天进行100次免费搜索。根据应用程序的用途,这可能是一个更安静的解决方案。

要获得唯一的搜索结果,您可以使用SelectorGadgets Chrome扩展,使用select()(可以在上迭代(或select_one()(只抓取一个元素(bs4方法直观地抓取CSS选择器。

for result in soup.select('CSS_SELECTOR'):
....
soup.select_one('CSS_SELECTOR') 

在线IDE中抓取标题、链接、显示的链接和代码段和示例的代码:

import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'Donald Trump'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf').a['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
snippet = result.select_one('.VwiC3b.yXK7lf.MUxGbd.yDYNvb.lyLwlc').text

print(f'{title}n{link}n{displayed_link}n{snippet}n')
# part of the output:
'''
Donald Trump | TheHill
https://thehill.com/people/donald-trump
https://thehill.com › people › donald-trump
12 hours ago — Donald Trump. Donald Trump. Getty Images. 0 Tweet Share More. Occupation: President of the United States, 2017 - 2021. Political Affiliation: Republican.
'''

或者,您也可以使用SerpApi的Google搜索引擎结果API来做同样的事情。这是一个付费的API,免费试用5000次搜索。

本质上,区别在于你不必考虑如何刮东西,绕过阻塞,这已经为最终用户完成了。看看操场。

要集成的代码:

from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "Donald Trump",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
snippet = result['snippet']
print(f'{title}n{link}n{displayed_link}n{snippet}n')
# part of the output:
'''
Donald Trump - Wikipedia
https://en.wikipedia.org/wiki/Donald_Trump
https://en.wikipedia.org › wiki › Donald_Trump
Donald John Trump (born June 14, 1946) is an American media personality and businessman who served as the 45th president of the United States from 2017 ...
'''

免责声明,我为SerpApi工作。

最新更新