我正在创建一个简单的网络爬虫,我希望它能抓取谷歌搜索查询(如"唐纳德·特朗普"(的结果网页。我已经写了以下代码:
# import requests
from urllib.request import urlopen as uReq
import urllib.request
from bs4 import BeautifulSoup as soup
paging_url = "https://www.google.gr/search?
ei=fvtMW8KMI4vdwQLS67yICA&q=donald+trump&oq=donald+trump&gs_l=psy- ab.3..35i39k1j0i131k1j0i203k1j0j0i203k1j0l3j0i203k1l2.4578.6491.0.6763.12.9.0.0.0.0.447.879.4-2.2.0....0...1c.1.64.psy-ab..10.2.878....0.aB3Y8R5B0U8"
req = urllib.request.Request("https://www.google.gr/search?ei=fvtMW8KMI4vdwQLS67yICA&q=donald+trump&oq=donald+trump&gs_l=psy-ab.3..35i39k1j0i131k1j0i203k1j0j0i203k1j0l3j0i203k1l2.4578.6491.0.6763.12.9.0.0.0.0.447.879.4-2.2.0....0...1c.1.64.psy-ab..10.2.878....0.aB3Y8R5B0U8", headers={'User-Agent': "Magic Browser"})
UClient = uReq(req) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
results = page_soup.findAll("div", {"class": "srg"})
print(len(results))
稍微解释一下我的想法以及我对谷歌页面结构的注意:
我试图只得到搜索结果,而不是谷歌也显示的推荐视频或图像。当推荐的视频或图像出现时,在带有"srg"类的两个"div"标签下存在九个结果。在这些"div"标签之间插入了另一个带有推荐视频/图像的"div"标记。
我的问题是,属于"srg"类的"div"标记不能被我的代码"看到"。我不知道为什么BeautifulSoup忽略了它们。同样的事情也发生在属于"rc"的"div"标记上班有人知道为什么会发生这种事吗?
我在使用PhantomJS制作网络爬虫来提取谷歌搜索数据时遇到了一些问题。有时我可以浏览几页,然后系统就会丢失。在某些情况下,我会看到在生成的代码中,我似乎在执行非法操作,我应该使用付费的API"自定义搜索JSON API"。我找到的解决方案是从雅虎网站创建爬虫。万一结果对我来说令人满意。
谷歌API让你每天进行100次免费搜索。根据应用程序的用途,这可能是一个更安静的解决方案。
要获得唯一的搜索结果,您可以使用SelectorGadgets Chrome扩展,使用select()
(可以在上迭代(或select_one()
(只抓取一个元素(bs4
方法直观地抓取CSS选择器。
for result in soup.select('CSS_SELECTOR'):
....
soup.select_one('CSS_SELECTOR')
在线IDE中抓取标题、链接、显示的链接和代码段和示例的代码:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'Donald Trump'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf').a['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
snippet = result.select_one('.VwiC3b.yXK7lf.MUxGbd.yDYNvb.lyLwlc').text
print(f'{title}n{link}n{displayed_link}n{snippet}n')
# part of the output:
'''
Donald Trump | TheHill
https://thehill.com/people/donald-trump
https://thehill.com › people › donald-trump
12 hours ago — Donald Trump. Donald Trump. Getty Images. 0 Tweet Share More. Occupation: President of the United States, 2017 - 2021. Political Affiliation: Republican.
'''
或者,您也可以使用SerpApi的Google搜索引擎结果API来做同样的事情。这是一个付费的API,免费试用5000次搜索。
本质上,区别在于你不必考虑如何刮东西,绕过阻塞,这已经为最终用户完成了。看看操场。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "Donald Trump",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
snippet = result['snippet']
print(f'{title}n{link}n{displayed_link}n{snippet}n')
# part of the output:
'''
Donald Trump - Wikipedia
https://en.wikipedia.org/wiki/Donald_Trump
https://en.wikipedia.org › wiki › Donald_Trump
Donald John Trump (born June 14, 1946) is an American media personality and businessman who served as the 45th president of the United States from 2017 ...
'''
免责声明,我为SerpApi工作。