如何使用python(最好是BS4)从Google Images(或bing)中找到图像的url



我尝试使用谷歌chrome上的inspect元素来查找图像链接,我看到了以下内容:

<img class="mimg rms_img" style="color: rgb(192, 54, 11);" height="185" width="297" alt="Image result for pizza" id="emb48403F0A" src="https://th.bing.com/th/id/OIP.uEcdCNY9nFhqWqbz4B0mFQHaEo?w=297&amp;h=185&amp;c=7&amp;o=5&amp;dpr=1.5&amp;pid=1.7" data-thhnrepbd="1" data-bm="186">

当我尝试使用:汤.findAll("img",{"class":"mimg-rms_img"}(搜索这个元素和其他类似元素时,我一无所获。

我是做错了什么,还是这不是最好的方法?

要提取缩略图原始大小URL的图像URL,您需要:

  1. 查找所有<script>标记
  2. 通过regex匹配和提取图像URL
  3. 遍历找到的匹配项并对其进行解码

在线IDE中的代码和完整示例(很简单,请慢慢阅读(:

import requests, lxml, re, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "pexels cat",
"tbm": "isch", 
"hl": "en",
"ijn": "0",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')

def get_images_data():
# this steps could be refactored to a more compact
all_script_tags = soup.select('script')
# # https://regex101.com/r/48UZhY/4
matched_images_data = ''.join(re.findall(r"AF_initDataCallback(([^<]+));", str(all_script_tags)))

# https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
# if you try to json.loads() without json.dumps it will throw an error:
# "Expecting property name enclosed in double quotes"
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/pdZOnW/3
matched_google_image_data = re.findall(r'["GRID_STATE0",null,[[1,[0,".*?",(.*),"All",', matched_images_data_json)
# https://regex101.com/r/NnRg27/1
matched_google_images_thumbnails = ', '.join(
re.findall(r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]',
str(matched_google_image_data))).split(', ')
print('Google Image Thumbnails:')  # in order
for fixed_google_image_thumbnail in matched_google_images_thumbnails:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
# after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
print(google_image_thumbnail)
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]', '', str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),["(https:|http.*?)",d+,d+]",
removed_matched_google_images_thumbnails)
print('nGoogle Full Resolution Images:')  # in order
for fixed_full_res_image in matched_google_full_resolution_images:
# https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
print(original_size_img)
get_images_data()
--------------
'''
Google Image Thumbnails:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSb48h3zks_bf6y7HnZGyGPn3s2TAHKKm_7kzxufi5nzbouJcQderHqoEoOZ4SpOuPDjfw&usqp=CAU
...
Google Full Resolution Images:
https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
...
'''

或者,您也可以使用SerpApi的Google Images API来完成。这是一个付费的API免费计划。

从本质上讲,主要的区别之一是,您不需要深入页面的源代码并通过regex提取内容,所需要做的就是迭代结构化JSON字符串并获得所需的数据。看看操场。

要集成的代码:

import os, json # json for pretty output
from serpapi import GoogleSearch

def get_google_images():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
get_google_images()
----------
'''
...
{
"position": 60, # img number
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRt-tXSZMBNLLX8MhavbBNkKmjJ7wNXxtdr5Q&usqp=CAU",
"source": "pexels.com",
"title": "1,000+ Best Cats Videos · 100% Free Download · Pexels Stock Videos",
"link": "https://www.pexels.com/search/videos/cats/",
"original": "https://images.pexels.com/videos/855282/free-video-855282.jpg?auto=compress&cs=tinysrgb&dpr=1&w=500",
"is_product": false
}
...
'''

p.S.-我写了一篇更深入的博客文章,里面有关于如何抓取谷歌图片的GIF和截图。

免责声明,我为SerpApi工作。

以下是如何在另一个搜索引擎DuckDuckGo:中做到这一点

search_query = 'what you want to find'
num_images = 10
driver_location = '/put/location/of/your/driver/here'
# setting up the driver
ser = Service(driver_location)
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=ser, options=op)
# searching the query
driver.get(f'https://duckduckgo.com/?q={search_query}&kl=us-en&ia=web')
# going to Images Section
ba = driver.find_element(By.XPATH, "//a[@class='zcm__link  js-zci-link  js-zci-link--images']")
ba.click()
# getting the images URLs
for result in driver.find_elements(By.CSS_SELECTOR, '.js-images-link')[0:0+num_images]:
imageURL = result.get_attribute('data-id')
print(f'{imageURL}n')
driver.quit()

最新更新