当没有图像扩展时,使用Beautiful Soup获取图像数据src



我正试图用漂亮的汤获取https://www.nb.co.za/en/books/0-6-years页面上所有书籍的所有图像URL。

这是我的代码:

from bs4 import BeautifulSoup
import requests
baseurl = "https://www.nb.co.za/"
productlinks = []
r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")
def my_filter(tag):
return (tag.name == 'a' and
tag.parent.name == 'div' and
'img-container' in tag.parent['class'])
for item in productlist:
for link in item.find_all(my_filter, href=True):
productlinks.append(baseurl + link['href'])
cover = soup.find_all('div', class_="img-container")
print(cover)

这是我的输出:

<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>

我希望得到的:

https://www.nb.co.za/en/helper/ReadImage/25929.jpg

我的问题是:

  1. 如何仅获取数据源?

  2. 如何获得图像的扩展?

1:如何只获取数据源?

您可以通过调用element['data-src']:来访问data-src

cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover

2:如何获得图像的扩展?

您可以访问上述diggusbickus文件的扩展名(很好的方法(,如果您试图请求这样的文件,这对您没有帮助https://www.nb.co.za/en/helper/ReadImage/25929.jpg这将导致404错误

图像被动态加载/提供附加信息->https://stackoverflow.com/a/5110673/14460824

示例

baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []
for item in soup.select('.book-slider-frame'):

data.append({
'link' : baseurl+item.a['href'],
'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
})

data

输出

[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]

我将向您展示如何为这个小示例执行此操作,我将让您处理其余部分。只需使用imghdr模块

import imghdr
import requests
from bs4 import BeautifulSoup
data="""<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>"""
soup=BeautifulSoup(data, 'lxml')
base_url="https://www.nb.co.za"
img_src=soup.select_one('img')['data-src']
img_name=img_src.split('/')[-1]
data=requests.get(base_url+img_src)
with open(img_name, 'wb') as f:
f.write(data.content)
print(imghdr.what(img_name))
>>> jpeg

要等到加载完所有图像,您可以告诉requests使用timeout参数,或者将其设置为timeout=None,如果页面加载缓慢,requests将永远等待响应。

在图像结果的末尾获得.gif的原因是图像尚未加载,并且gif显示了这一点。

访问data-src属性的方式与访问字典的方式相同:class[attribute]


如果您想在本地保存图像,可以使用urllib.request.urlretrieve:

import urllib.request
urllib.request.urlretrieve("BOOK_COVER_URL", file_name.jpg) # will save in the current directory

在线IDE中的代码和示例:

from bs4 import BeautifulSoup
import requests, lxml
response = requests.get(f'https://www.nb.co.za/en/books/0-6-years', timeout=None)
soup = BeautifulSoup(response.text, 'lxml')
for result in soup.select(".img-container"):
link = f'https://www.nb.co.za{result.select_one("a")["href"]}'
# try/except to handle error when there's no image on the website (last 3 results)
try:
image = f'https://www.nb.co.za{result.select_one("a img")["data-src"]}'
except: image = None
print(link, image, sep="n")

# part of the output:
'''
# first result (Step by Step: Counting to 50)
https://www.nb.co.za/en/view-book/?id=9780798182539
https://www.nb.co.za/en/helper/ReadImage/25929
# last result WITH image preview (Dinosourusse - Feite en geite: Daar’s ’n trikeratops op die trampoline)
https://www.nb.co.za/en/helper/ReadImage/10853
https://www.nb.co.za/en/view-book/?id=9780624035480
# last result (Uhambo lukamusa (isiZulu)) WITH NO image preview on the website as well so it returned None
https://www.nb.co.za/en/view-book/?id=9780624043003
None
'''