在使用beautifulsoup刮取之前按名称过滤图像



我正试图创建一个程序,使用bs4在网站上抓取图像。该网站包含两种类型的图像,低质量图像和高质量图像。高质量文件的名称与其低质量版本的名称相同,但包含单词";网站";和.png之前的结尾;网站";文件。这是我试过的。

from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if not image.endswith("Website.png"):
continue
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)

我没有收到任何错误消息,但没有下载任何文件。有什么建议吗?

您只是在检查它是否以"网址.png"在你已经确定它不是之后。最好不要检查是否没有:

from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)

实际上,使用列表理解可以减少代码的过程性,并防止将来出现错误:

from bs4 import BeautifulSoup
import requests
from requests.compat import urljoin
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')

image_urls = [urljoin(URL,image.get('src')) for image in soup.find_all('img')]
# let's make this one a generator so we don't keep too many downloaded
# images in memory
images = (requests.get(url) for url in image_urls if url.endswith("Website.png")) 
for image in images:
# use the context manager so the files are closed after write
with open('scraped_images/' + image.split('/')[-1], 'wb') as f:
f.write(image.content)

最新更新