无法获取谷歌搜索结果 python



我正在构建一个脚本来抓取谷歌搜索结果。我已经到达了这里。

import urllib
keyword = "google"
print urllib.urlopen("https://www.google.co.in/search?q=" + keyword).read()

但它给了我如下答复:

<!DOCTYPE html><html lang=en><meta charset=utf-8><meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"><title>Error 403 (Forbidden)!!1</title><style>*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/errors/logo_sm_2.png) no-repeat}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/errors/logo_sm_2_hr.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/errors/logo_sm_2_hr.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:55px;width:150px}</style><a href=//www.google.com/><span id=logo aria-label=Google></span></a><p><b>403.</b> <ins>That’s an error.</ins><p>Your client does not have permission to get URL <code>/search?q=google</code> from this server.  (Client IP address: 117.196.168.89)<br><br>
Please see Google's Terms of Service posted at http://www.google.com/terms_of_service.html
<BR><BR><P>If you believe that you have received this response in error, please <A HREF="http://www.google.com/support/bin/request.py?contact_type=user&hl=en">report</A> your problem. However, please make sure to take a look at our Terms of Service (http://www.google.com/terms_of_service.html). In your email, please send us the <b>entire</b> code displayed below.  Please also send us any information you may know about how you are performing your Google searches-- for example, "I'm using the Opera browser on Linux to do searches from home.  My Internet access is through a dial-up account I have with the FooCorp ISP." or "I'm using the Konqueror browser on Linux to search from my job at myFoo.com.  My machine's IP address is 10.20.30.40, but all of myFoo's web traffic goes through some kind of proxy server whose IP address is 10.11.12.13."  (If you don't know any information like this, that's OK.  But this kind of information can help us track down problems, so please tell us what you can.)</P><P>We will use all this information to diagnose the problem, and we'll hopefully have you back up and searching with Google again quickly!</P>
<P>Please note that although we read all the email we receive, we are not always able to send a personal response to each and every email.  So don't despair if you don't hear back from us!</P>
<P>Also note that if you do not send us the <b>entire</b> code below, <i>we will not be able to help you</i>.</P><P>Best wishes,<BR>The Google Team</BR></P><BLOCKQUOTE>/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/<BR>
aef0-l8vNw3cWys_OWGKrv6VYDewUx0bhWxSeo2Mk4vGTZSoh<BR>
MdeNZki3vp-kzRGjrBTseg6uGBypibuTNGSeJoPRkDPCOFkyA<BR>
YBVgssaJaqSibV7khohBnsUVRVZqALwIe2lD6pdddMQIZ-Zg2<BR>
WEE-rO-ZackE5L2gwlmHZHP2oWML3ZlGgUL6CAbMbFmzVda38<BR>
ZYYVZLKBcjY1gSLk-FSzBc7QQnp0vrhkY6LnrALX94oK7Yrml<BR>
bKX-5KmpyhsI7aW3da5Rt5nt0K9PVPbKvpZ1LN-hdRqg749K6<BR>
T4v8mGfuH6BHSQUAPW1Byx_Wy1TGsyhZJQ02jrz7K0RBg4r0i<BR>
9O6Rs7-FFRzESkiyzRQaExUdpBpl3Mmguh1JXR_yxDJre9R7u<BR>
3AWKfCkt8BxKuv37oAIslM2Caor4QBXSNrq1F7zUetx8HxmaW<BR>
pX_6KsXyjs3-Pfq5NKOuzNCjatrhXdKC74NmNHztTPJU-4MzV<BR>
kUPuUehnDYgcgGAVYLLGiWvG4Scm8G2Gq2UnacMQsZ5BB7rgY<BR>
DXJnZwbMbVX53-llhCMeQfBTteOWIfWQR2FOyc-tuaRHX6c3N<BR>
rzpNDX9ZufFfOXRNkaORCZxkSEoX1xDBq0VGdkkCfwlUdG9Jq<BR>
prYBPnpRyhjxjC3c4n68AuEYHtMTVmbK-fyMtcWLMTVXzIrYS<BR>
EjACpMTnHRavhYza4ZJgs4SViS4FrsmJ0P3CdyLLayR0xMFM6<BR>
m7rxy-zaABo7iof_re5PKcFP6EYqD0Wm-ZlLksUh2a1LVaAsq<BR>
sSqnPPqq5qCu0z8wQe5jeGCRCY2vrT5HWmYNJbhyCyN_HiHGR<BR>
bHDb8f3_OcgAHsT7zv1a4FOG4B0JztqskzYmssBb-ezvErkp6<BR>
uZtwiKJc30F30RpQhKEb_rPjhpwc5dr3MUsTuki2j2tBSQl_O<BR>
kjFef_Jvl3u8TPQY5c6dqUSQv--p0N95Jv-WehS32lvyUbeEB<BR>
mN7ZC8oCFj06BRn5NaU9P8p1d7fmYyxyta2dZ21UfaRMhX8TZ<BR>
VgKiSDVyMO2GZ09bUEFGW4KvvTJDyQT_UMkCsahrv2MP_yI-D<BR>
fwEArSXvPIpyESHeyPXfFN-Z9_OuVwGDU2riHFIWgw5IPwtER<BR>
e0Ukzrn2iwGHHL8j2JdSNbunrifS-RqkK2hgQl16-TfqN11NL<BR>
Lgwtt-Kp3XL86K61Qq7lU-NxB8BOO_i-QOQszn6uRmb3VR__Q<BR>
T_0E9FULbsR9kgTyXDKQmOQ-3qeaFlz4in9V9PJ<BR>
+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+/+<BR></BLOCKQUOTE>

谷歌不允许其页面被抓取吗?

实际上,谷歌没有,从某种意义上说,它阻止了机器人。但是您可以使用机械化来伪造浏览器并获得结果。

   进口机械化    铬 = 机械化。浏览器()    chrome.set_handle_robots(假)    chrome.addheaders = [('User-agent',    'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36')]    base_url = 'https://www.google.co.in/search?q='    search_url = base_url + 关键字.replace(' ', '+')    htmltext = chrome.open(search_url).read()

试试这个。我希望它有所帮助。

您也可以

伪造urllib中的headers以获得结果。

像这样:

import urllib2
keyword = "google"
url = "https://www.google.co.in/search?q=" + keyword
# Build a opener
opener = urllib2.build_opener()
# In case you have proxy then u need to build a ProxyHandler opener 
#opener = urllib2.build_opener(urllib2.ProxyHandler(proxies={"http": "http://proxy.corp.ads:8080"}))
# To fake the browser
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open(url).read()

Google 会用不同的user-agent处理您的脚本(如果您使用的是requests它将python-requests ) 看到越来越多。

您所需要的只是指定浏览器user-agent(Chrome,Mozilla,Edge,IE,Safari..),因此Google将其视为"用户"AKA伪造真正的浏览器访问。

如果您使用的是requests库,则可以以这种方式指定它(其他网站的用户代理列表)

import requests
headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(
  'https://www.google.com/search?q=pizza is awesome', headers=headers).text

我在这里回答了有关如何抓取Google搜索结果标题,摘要和带有示例代码的链接的问题。

<小时 />

或者,您可以使用第三方谷歌搜索引擎结果API或来自SerpApi的Google Organic Results API。这是一个免费试用的付费API。

查看游乐场进行测试并查看输出。

获取原始 HTML 响应的代码:

import os, urllib
from serpapi import GoogleSearch
params = {
    "engine": "google",
    "q": "london",
    "api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
html = results['search_metadata']['raw_html_file']
print(urllib.request.urlopen(html).read())

免责声明,我为 SerpApi 工作。

最新更新