我想使用谷歌图片搜索下载批量图片。
我的第一种方法; 将页面源代码下载到文件,然后用open()
打开它工作正常,但我希望能够通过运行脚本和更改关键字来获取图像 url。
第一种方法:转到图像搜索(https://www.google.no/search?q=tower&client=opera&hs=UNl&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM5fnf4_zKAhWIJJoKHYUdBg4Q_AUIBygB&biw=1920&bih=982(。在浏览器中查看页面源代码并将其保存到 html 文件中。然后,当我使用脚本open()
该 html 文件时,脚本按预期工作,并且我在搜索页面上获得了所有图像 url 的整洁列表。这就是脚本的第 6 行所做的(取消注释以进行测试(。
但是,如果我使用 requests.get()
函数来解析网页,如脚本的第 7 行所示,它会获取一个不同的 html 文档,该文档不包含图像的完整 URL,因此我无法提取它们。
请帮助我提取图像的正确网址。
编辑:链接到塔.html,我正在使用:https://www.dropbox.com/s/yy39w1oc8sjkp3u/tower.html?dl=0
这是我到目前为止编写的代码:
import requests
from bs4 import BeautifulSoup
# define the url to be scraped
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'
# top line is using the attached "tower.html" as source, bottom line is using the url. The html file contains the source of the above url.
#page = open('tower.html', 'r').read()
page = requests.get(url).text
# parse the text as html
soup = BeautifulSoup(page, 'html.parser')
# iterate on all "a" elements.
for raw_link in soup.find_all('a'):
link = raw_link.get('href')
# if the link is a string and contain "imgurl" (there are other links on the page, that are not interesting...
if type(link) == str and 'imgurl' in link:
# print the part of the link that is between "=" and "&" (which is the actual url of the image,
print(link.split('=')[1].split('&')[0])
您可以使用正则表达式提取Google图片,因为您需要的数据会动态呈现,但我们可以在内联JSON中找到它。它的方法比使用浏览器自动化更快。
为此,我们可以在页面源代码中搜索第一个图像标题( Ctrl+U
(以找到我们需要的匹配项,如果<script>>
元素中有任何匹配项,那么它很可能是内联 JSON。从那里我们可以提取数据。
要找到原始图像,我们首先需要找到缩略图。之后,我们需要减去部分解析的内联JSON,这将提供一种更简单的方法来解析原始分辨率图像:
# https://regex101.com/r/SxwJsW/1
matched_google_images_thumbnails = ", ".join(
re.findall(r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]',
str(matched_google_image_data))).split(", ")
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]', "", str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),["(https:|http.*?)",d+,d+]", removed_matched_google_images_thumbnails)
不幸的是,这种方法无法找到绝对的图片,因为它们是使用滚动添加到页面中的。如果您需要收集绝对所有图片,则需要使用浏览器自动化,例如selenium
或playwright
,如果您不想对其进行逆向工程。
有一个"ijn" URL parameter
定义要获取的页码(大于或等于 0(。它与也位于内联 JSON 中的分页令牌结合使用。
检查在线 IDE 中的代码。
import requests, re, json, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
}
google_images = []
params = {
"q": "tower", # search query
"tbm": "isch", # image results
"hl": "en", # language of the search
"gl": "us" # country where search comes fro
}
html = requests.get("https://google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
all_script_tags = soup.select("script")
# https://regex101.com/r/RPIbXK/1
matched_images_data = "".join(re.findall(r"AF_initDataCallback(([^<]+));", str(all_script_tags)))
matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
# https://regex101.com/r/NRKEmV/1
matched_google_image_data = re.findall(r'"b-GRID_STATE0"(.*)sideChannel:s?{}}', matched_images_data_json)
# https://regex101.com/r/SxwJsW/1
matched_google_images_thumbnails = ", ".join(
re.findall(r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]',
str(matched_google_image_data))).split(", ")
thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]
# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
r'["(https://encrypted-tbn0.gstatic.com/images?.*?)",d+,d+]', "", str(matched_google_image_data))
# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),["(https:|http.*?)",d+,d+]", removed_matched_google_images_thumbnails)
full_res_images = [
bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
]
for index, (metadata, thumbnail, original) in enumerate(zip(soup.select('.isv-r.PNCib.MSM1fd.BUooTd'), thumbnails, full_res_images), start=1):
google_images.append({
"title": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["title"],
"link": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["href"],
"source": metadata.select_one(".fxgdke").text,
"thumbnail": thumbnail,
"original": original
})
print(json.dumps(google_images, indent=2, ensure_ascii=False))
示例输出:
[
{
"title": "Eiffel Tower - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Eiffel_Tower",
"source": "Wikipedia",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTsuYzf9os1Qb1ssPO6fWn-5Jm6ASDXAxUFYG6eJfvmehywH-tJEXDW0t7XLR3-i8cNd-0&usqp=CAU",
"original": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/85/Tour_Eiffel_Wikimedia_Commons_%28cropped%29.jpg/640px-Tour_Eiffel_Wikimedia_Commons_%28cropped%29.jpg"
},
{
"title": "tower | architecture | Britannica",
"link": "https://www.britannica.com/technology/tower",
"source": "Encyclopedia Britannica",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8EsWofNiFTe6alwRlwXVR64RdWTG2fuBQ0z1FX4tg3HbL7Mxxvz6GnG1rGZQA8glVNA4&usqp=CAU",
"original": "https://cdn.britannica.com/51/94351-050-86B70FE1/Leaning-Tower-of-Pisa-Italy.jpg"
},
{
"title": "Tower - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Tower",
"source": "Wikipedia",
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT3L9LA0VamqmevhCtkrHZvM9MlBf9EjtTT7KhyzRP3zi3BmuCOmn0QFQG42xFfWljcsho&usqp=CAU",
"original": "https://upload.wikimedia.org/wikipedia/commons/3/3e/Tokyo_Sky_Tree_2012.JPG"
},
# ...
]
<小时 />您也可以使用SerpApi的Google Images API。这是一个带有免费计划的付费 API。不同之处在于它将绕过来自Google的块(包括CAPTCHA(,无需创建解析器并维护它。
简单的代码示例:
from serpapi import GoogleSearch
import os, json
image_results = []
# search query parameters
params = {
"engine": "google", # search engine. Google, Bing, Yahoo, Naver, Baidu...
"q": "tower", # search query
"tbm": "isch", # image results
"num": "100", # number of images per page
"ijn": 0, # page number: 0 -> first page, 1 -> second...
"api_key": os.getenv("API_KEY") # your serpapi api key
# other query parameters: hl (lang), gl (country), etc
}
search = GoogleSearch(params) # where data extraction happens
images_is_present = True
while images_is_present:
results = search.get_dict() # JSON -> Python dictionary
# checks for "Google hasn't returned any results for this query."
if "error" not in results:
for image in results["images_results"]:
if image["original"] not in image_results:
image_results.append(image["original"])
# update to the next page
params["ijn"] += 1
else:
images_is_present = False
print(results["error"])
print(json.dumps(image_results, indent=2))
输出:
[
"https://cdn.rt.emap.com/wp-content/uploads/sites/4/2022/08/10084135/shutterstock-woods-bagot-rough-site-for-leadenhall-tower.jpg",
"https://dynamic-media-cdn.tripadvisor.com/media/photo-o/1c/60/ff/c5/ambuluwawa-tower-is-the.jpg?w=1200&h=-1&s=1",
"https://cdn11.bigcommerce.com/s-bf3bb/product_images/uploaded_images/find-your-nearest-cell-tower-in-five-minutes-or-less.jpeg",
"https://s3.amazonaws.com/reuniontower/Reunion-Tower-Exterior-Skyline.jpg",
"https://assets2.rockpapershotgun.com/minecraft-avengers-tower.jpg/BROK/resize/1920x1920%3E/format/jpg/quality/80/minecraft-avengers-tower.jpg",
"https://images.adsttc.com/media/images/52ab/5834/e8e4/4e0f/3700/002e/large_jpg/PERTAMINA_1_Tower_from_Roundabout.jpg?1386960835",
"https://awoiaf.westeros.org/images/7/78/The_tower_of_joy_by_henning.jpg",
"https://eu-assets.simpleview-europe.com/plymouth2016/imageresizer/?image=%2Fdmsimgs%2Fsmeatontower3_606363908.PNG&action=ProductDetailNew",
# ...
]
有一个 刮擦 并下载谷歌图片与 Python 博客文章,如果你需要更多的代码解释。
只是为了让你知道:
# http://www.google.com/robots.txt
User-agent: *
Disallow: /search
我想在回答之前说,谷歌严重依赖脚本。 您很可能会得到不同的结果,因为您通过reqeusts
请求的页面不会对页面上提供的script
执行任何操作,而在 Web 浏览器中加载页面会。
这是我请求您提供的网址时得到的
我从requests.get(url).text
那里得到的文本在任何地方都不包含'imgurl'
。 您的脚本正在寻找它作为其标准的一部分,但它不存在。
但是,我确实看到一堆<img>
标签,其中src
属性设置为图像 url。 如果这就是你所追求的,请尝试以下脚本:
import requests
from bs4 import BeautifulSoup
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'
# page = open('tower.html', 'r').read()
page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')
for raw_img in soup.find_all('img'):
link = raw_img.get('src')
if link:
print(link)
这将返回以下结果:
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyxRHrFw0NM-ZcygiHoVhY6B6dWwhwT4va727380n_IekkU9sC1XSddAg
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRfuhcCcOnC8DmOfweuWMKj3cTKXHS74XFh9GYAPhpD0OhGiCB7Z-gidkVk
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSOBZ9iFTXR8sGYkjWwPG41EO5Wlcv2rix0S9Ue1HFcts4VcWMrHkD5y10
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTEAZM3UoqqDCgcn48n8RlhBotSqvDLcE1z11y9n0yFYw4MrUFucPTbQ0Ma
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSJvthsICJuYCKfS1PaKGkhfjETL22gfaPxqUm0C2-LIH9HP58tNap7bwc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQGNtqD1NOwCaEWXZgcY1pPxQsdB8Z2uLGmiIcLLou6F_1c55zylpMWvSo
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSdRxvQjm4KWaxhAnJx2GNwTybrtUYCcb_sPoQLyAde2KMBUhR-65cm55I
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQLVqQ7HLzD7C-mZYQyrwBIUjBRl8okRDcDoeQE-AZ2FR0zCPUfZwQ8Q20
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQHNByVCZzjSuMXMd-OV7RZI0Pj7fk93jVKSVs7YYgc_MsQqKu2v0EP1M0
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcS_RUkfpGZ1xJ2_7DCGPommRiIZOcXRi-63KIE70BHOb6uRk232TZJdGzc
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSxv4ckWM6eg_BtQlSkFP9hjRB6yPNn1pRyThz3D8MMaLVoPbryrqiMBvlZ
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQWv_dHMr5ZQzOj8Ort1gItvLgVKLvgm9qaSOi4Uomy13-gWZNcfk8UNO8
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRRwzRc9BJpBQyqLNwR6HZ_oPfU1xKDh63mdfZZKV2lo1JWcztBluOrkt_o
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQdGCT2h_O16OptH7OofZHNvtUhDdGxOHz2n8mRp78Xk-Oy3rndZ88r7ZA
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRnmn9diX3Q08e_wpwOwn0N7L1QpnBep1DbUFXq0PbnkYXfO0wBy6fkpZY
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSaP9Ok5n6dL5K1yKXw0TtPd14taoQ0r3HDEwU5F9mOEGdvcIB0ajyqXGE
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTcyaCvbXLYRtFspKBe18Yy5WZ_1tzzeYD8Obb-r4x9Yi6YZw83SfdOF5fm
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnS1qCjeYrbUtDSUNcRhkdO3fc3LTtN8KaQm-rFnbj_JagQEPJRGM-DnY0
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSiX_elwJQXGlToaEhFD5j2dBkP70PYDmA5stig29DC5maNhbfG76aDOyGh
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQb3ughdUcPUgWAF6SkPFnyiJhe9Eb-NLbEZl_r7Pvt4B3mZN1SVGv0J-s
您可以使用"data-src"或"src"属性找到属性。
REQUEST_HEADER = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
def get_images_new(self, prod_id, name, header, **kw):
i=1
man_code = "apple" #anything you want to search for
url = "https://www.google.com.au/search?q=%s&source=lnms&tbm=isch" % man_code
_logger.info("Subitemsyyyyyyyyyyyyyy: %s" %url)
response = urlopen(Request(url, headers={
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}))
html = response.read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")
image_elements = soup.find_all("img", {"class": "rg_i Q4LuWd"})
for img in image_elements:
#temp1 = img.get('src')
#_logger.info("11111[%s]" % (temp1))
temp = img.get('data-src')
if temp and i < 7:
image = temp
#_logger.error("11111[%s]" % (image))
filename = str(i)
if filename:
path = "/your/directory/" + str(prod_id) # your filename
if not os.path.exists(path):
os.mkdir(path)
_logger.error("ath.existath.existath.exist[%s]" % (image))
imagefile = open(path + "/" + filename + ".png", 'wb+')
req = Request(image, headers=REQUEST_HEADER)
resp = urlopen(req)
imagefile.write(resp.read())
imagefile.close()
i += 1