尝试使用谷歌搜索搜索图像,错误400



我一直收到此错误:urllib.error.HTTPError:HTTP 错误 400:错误请求

我相信这可能与链接有关,因为当我将它们放入(并替换 {}(时,我收到相同的错误,但我不知道哪些链接是正确的/(Python 3.6, Anaconda(

import os
import urllib.request as ulib
from bs4 import BeautifulSoup as Soup
import json
url_a = 'https://www.google.com/search?ei=1m7NWePfFYaGmQG51q7IBg&hl=en&q={}'
url_b = '&tbm=isch&ved=0ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ&start={}'
url_c = '&yv=2&vet=10ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ.1m7NWePfFYaGmQG51q7IBg'
url_d = '.i&ijn=1&asearch=ichunk&async=_id:rg_s,_pms:s'
url_base = ''.join((url_a, url_b, url_c, url_d))
headers = {'User-Agent': 'Chrome/69.0.3497.100'}
def get_links(search_name):
search_name = search_name.replace(' ', '+')
url = url_base.format(search_name, 0)
request = ulib.Request(url, data=None, headers=headers)
json_string = ulib.urlopen(request).read()
page = json.loads(json_string)
new_soup = Soup(page[1][1], 'lxml')
images = new_soup.find_all('img')
links = [image['src'] for image in images]
return links
if __name__ == '__main__':
search_name = 'Thumbs up'
links = get_links(search_name)
for link in links:
print(link)

我认为你有一堆不需要的参数

试试这个更简单的URL进行图像搜索:

https://www.google.com/search?q={KEY_WORD}&tbm=isch

例如:

https://www.google.com/search?q=apples&tbm=isch

我认为问题出在不能与search一起使用asearch=ichunk&async=_id:rg_s,_pms:s,如果我删除它们,它可以工作:

import os
import urllib.request as ulib
from bs4 import BeautifulSoup as Soup
import json
url_a = 'https://www.google.com/search?ei=1m7NWePfFYaGmQG51q7IBg&hl=en&q=a+mouse'
url_b = '&tbm=isch&ved=0ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ&start={}'
url_c = '&yv=2&vet=10ahUKEwjjovnD7sjWAhUGQyYKHTmrC2kQuT0I7gEoAQ.1m7NWePfFYaGmQG51q7IBg'
url_d = '.i&ijn=1'
url_base = ''.join((url_a, url_b, url_c, url_d))
print(url_base);
headers = {'User-Agent': 'Chrome/69.0.3497.100'}
def get_links(search_name):
search_name = search_name.replace(' ', '+')
url = url_base.format(search_name, 0)
request = ulib.Request(url, data=None, headers=headers)
json_string = ulib.urlopen(request).read()
print(json_string)
page = json.loads(json_string)
new_soup = Soup(page[1][1], 'lxml')
images = new_soup.find_all('img')
links = [image['src'] for image in images]
return links
if __name__ == '__main__':
search_name = 'Thumbs up'
links = get_links(search_name)
for link in links:
print(link)

我不太确定您通过beautifulsoup抓取 JSON 数据来尝试做什么,因为它无法做到这一点。相反,您可以通过模块re发布可能包含 JSON 数据的<script>标签,然后迭代解析的 JSON 字符串。

看看requsets图书馆。您可以通过仅在params(dict(变量中添加所需的查询参数(LeKhan9已经提到(来获得更易于阅读的代码,然后将其传递到request.get()就像您对headers所做的那样:

params = {
"q": "minecraft lasagna skin",
"tbm": "isch",
"ijn": "0", # batch of 100 images
}
request.get(URL, params=params)

在线 IDE 中的代码和完整示例,在顶部也抓取建议的搜索结果(尝试逐步阅读,这非常简单(:

import requests, lxml, re, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "minecraft lasagna skin",
"tbm": "isch",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
print('nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}n{source}n{link}n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback(([^<]+));", str(all_script_tags)))
# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'["GRID_STATE0",null,[[1,[0,".*?",(.*),"All",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:')  # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),["(https:|http.*?)",d+,d+]",
removed_matched_google_images_thumbnails)
print('nGoogle Full Resolution Images:')  # in order
for fixed_full_res_image in matched_google_full_resolution_images:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)

----------------
'''
Google Images Metadata:
Lasagna Minecraft Skins | Planet Minecraft Community
planetminecraft.com
https://www.planetminecraft.com/skins/tag/lasagna/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSPttXb_7ClNBirfv2Beh4aOBjlc-7Jw_kY8pZ4DrkbAavZcJEtz8djo_9iqdnatiG6Krw&usqp=CAU
...
Google Full Resolution Images:
https://static.planetminecraft.com/files/resource_media/preview/skinLasagnaman_minecraft_skin-6204972.jpg
...
'''

或者,您可以使用SerpApi的Google Images API来实现此目的。这是一个带有免费计划的付费 API。

最大和明显的区别是,您只需要使用已解析的数据迭代结构化 JSON,而无需弄清楚为什么某些内容无法正确解析。看看游乐场。

要集成的代码:

import os, json # json for pretty output
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "minecraft shaders 8k photo",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['suggested_searches'], indent=2, ensure_ascii=False))
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False)
-----------
# same output as above but in JSON format

我写了一篇关于如何以更详细的方式抓取谷歌图片的博客文章。

Dislaimer,我为SerpApi工作。

最新更新