使用 Python 请求/Beautiful Soup 从抓取的 div 类中解析 JSON 数据



我正在尝试使用请求和美丽汤从Google搜索结果中抓取一些图像。 网络上似乎有使用 urllib2 的代码,它可以工作(对我来说是一半的时间(,但我正在尝试使用带有美丽汤的请求, 我在解析 JSON 部分时遇到问题。我有兴趣得到 "ou"值,这是一个链接。我不确定我做错了什么。

import requests
from bs4 import BeautifulSoup
import json
url =  'https://www.google.com/search?site=&tbm=isch&source=hp&biw=1873&bih=990&'
payload = {'q': 'Blue Sky'}
response = requests.get(url, params = payload)
print (response.url)
images =[]
soup = BeautifulSoup(response.content, 'html.parser')
results2 =soup.find_all(("div",{"class":"rg_meta notranslate"}))
#checking results2, It seems I am indeed extracting the div portion. 

for re in results2:
link, Type = json.loads((re.text))["ou"] , json.loads((re.text))["ity"]
images.append(link)

这是div 类的外观:

<div class="rg_meta notranslate">
{"clt":"n",
"id":"tO9o23RfxP9tlM:",
"isu":"myrabridgforth.com",
"itg":0,
"ity":"jpg",
"oh":742,
"ou":"http://myrabridgforth.com/wp-content/uploads/blue-   sky-clouds.jpg","ow":1268,"pt":"Myra Bridgforth, Counselor » Blog Archive Ten Ways to Use a Blue ...","rid":"jjIitG_NjwFNSM","rmt":0,"rt":0,"ru":"http://myrabridgforth.com/2015/06/ten-ways-to-use-a-blue-sky-hour-at-a-coffee-shop/","s":"Ten Ways to Use a Blue Sky Hour at a Coffee Shop","st":"Myra Bridgforth, Counselor","th":172,"tu":"https://encrypted-tbn0.gstatic.com/images?qu003dtbn:ANd9GcTLhBlZEL6ljsKInKzx1V4GX-lXeksntKy6B4UkmVrOB_2uNoTbcQ","tw":294}
</div>

运行 JSON 行,我最终出现此错误:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

下面是results2结果集的前 15% 左右的外观:

[<div id="gbar"><nobr><a class="gb1" href="https://www.google.com/search?tab=iw">Search</a> <b class="gb1">Images</b> <a class="gb1" href="https://maps.google.com/maps?hl=en&amp;tab=il">Maps</a> <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=i8">Play</a> <a class="gb1" href="https://www.youtube.com/results?tab=i1">YouTube</a> <a class="gb1" href="https://news.google.com/nwshp?hl=en&amp;tab=in">News</a> <a class="gb1" href="https://mail.google.com/mail/?tab=im">Gmail</a> <a class="gb1" href="https://drive.google.com/?tab=io">Drive</a> <a class="gb1" href="https://www.google.com/intl/en/options/" style="text-decoration:none"><u>More</u> »</a></nobr></div>,
<div id="guser" width="100%"><nobr><span class="gbi" id="gbn"></span><span class="gbf" id="gbf"></span><span id="gbe"></span><a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a> | <a class="gb4" href="/preferences?hl=en">Settings</a> | <a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&amp;passive=true&amp;continue=https://www.google.com/search%3Fsite%3D%26tbm%3Disch%26source%3Dhp%26biw%3D1873%26bih%3D990%26q%3DBlue%2BSky" id="gb_70" target="_top">Sign in</a></nobr></div>,
<div class="gbh" style="left:0"></div>,
<div class="gbh" style="right:0"></div>,
<div id="logocont"><h1><a href="/webhp?hl=en" id="logo" style="background:url(/images/nav_logo229.png) no-repeat 0 -41px;height:37px;width:95px;display:block" title="Go to Google Home"></a></h1></div>,
<div class="lst-a"><table cellpadding="0" cellspacing="0"><tr><td class="lst-td" valign="bottom" width="555"><div style="position:relative;zoom:1"><input autocomplete="off" class="lst" id="sbhost" maxlength="2048" name="q" title="Search" type="text" value="Blue Sky"/></div></td></tr></table></div>,

我的代码基于 rishabhr0y 的代码,该代码似乎取得了成功(根据评论( 与美丽的汤和urllib2。

Python - 从谷歌图片搜索下载图片?

要使用requestsbeautifulsoup抓取全分辨率图像 URL,您需要通过regex从页面源代码中抓取数据。

查找所有标签:

soup.select('script')

通过regex匹配图像数据:

matched_images_data = ''.join(re.findall(r"AF_initDataCallback(([^<]+));", str(all_script_tags)))

通过 JSON 字符串中的regex匹配所需的图像(完整分辨率大小(:

matched_google_full_resolution_images = re.findall(r"(?:'|,),["(https:|http.*?)",d+,d+]",
matched_images_data)

使用bytes()decode()提取和解码它们:

for fixed_full_res_image in matched_google_full_resolution_images:
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')

如果您需要保存它们,您可以通过urllib.request.urlretrieverequests进行两个简单的选择:

要通过urllib.request.urlretrieve(url, filename)保存图像(更深入(:

import urllib.request
# often times it will throw 404 error, to avoid it we need to pass user-agent
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'LOCAL_FOLDER_NAME/YOUR_IMAGE_NAME.jpg') # you can skip folder path and it will save them in current working directory

要通过请求保存图像(代码取自此答案(:

import requests
url = "YOUR_IMG.jpg"
response = requests.get(url)
if response.status_code == 200:
with open("/YOUR/PATH/TO_IMAGE/sample_img.jpg", 'wb') as f:
f.write(response.content)

用于抓取和下载全分辨率图像的代码,以及在线 IDE 中的完整示例:

import requests, lxml, re, json, urllib.request
from bs4 import BeautifulSoup

headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "pexels cat",
"tbm": "isch", 
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')

def get_images_data():
print('nGoogle Images Metadata:')
for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
source = google_image.select_one('.fxgdke').text
link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
print(f'{title}n{source}n{link}n')
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback(([^<]+));", str(all_script_tags)))

# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'["GRID_STATE0",null,[[1,[0,".*?",(.*),"All",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:')  # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),["(https:|http.*?)",d+,d+]",
removed_matched_google_images_thumbnails)

print('nDownloading Google Full Resolution Images:')  # in order
for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
# ------------------------------------------------
# Download original images
# print(f'Downloading {index} image...')

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(original_size_img, f'Images/original_size_img_{index}.jpg')

get_images_data()

-------------
'''
Google Images Metadata:
9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
pexels.com
https://www.pexels.com/search/cat/
...
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
...
Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

或者,您可以使用SerpApi的Google Images API来实现相同的目标。这是一个带有免费计划的付费 API。

不同之处在于,您不必处理正则表达式来匹配并从页面的源代码中提取所需的数据,相反,您只需要迭代结构化 JSON 并获得所需的内容。

要集成的代码:

import os, urllib.request, json # json for pretty output
from serpapi import GoogleSearch

def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
# print(json.dumps(results['suggested_searches'], indent=2, ensure_ascii=False))
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# -----------------------
# Downloading images
for index, image in enumerate(results['images_results']):
# print(f'Downloading {index} image...')

opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')

get_google_images()
---------------
'''
[
...
{
"position": 100, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
"source": "pexels.com",
"title": "Close-up of Cat · Free Stock Photo",
"link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
"original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
]
'''

附言 - 我写了一篇关于如何抓取谷歌图片的更深入的博客文章。

免责声明,我为 SerpApi 工作。

最新更新