如何在使用请求下载图像时仅当发生"No Internet"或"Network Error"之类的情况时才停止进程

我已经编写了一个脚本，从提供的url下载图像并将其保存在一个目录中。它使用requests访问DataFrame(CSV文件(中给定的URL，并使用PILLOW下载我目录中的图像。图像的NAme是我的CSV中该url的索引号。如果有任何不可访问的坏url，它只会增加计数器。每次我运行脚本时，它都会开始从最大索引下载到所需的索引。我的代码运行良好。它是这样的：

import pandas as pd
import os
from os import listdir
from os.path import isfile, join
import sys
from PIL import Image
import requests
from io import BytesIO
import argparse

arg_parser = argparse.ArgumentParser(allow_abbrev=True, description='Download images from url in a directory',)
arg_parser.add_argument('-d','--DIR',required=True,
help='Directory name where images will be saved')
arg_parser.add_argument('-c','--CSV',required=True,
help='CSV file name which contains the URLs')
arg_parser.add_argument('-i','--index',type=int,
help='Index number of column which contain the urls')
arg_parser.add_argument('-e','--end',type=int,
help='How many images to download')
args = vars(arg_parser.parse_args())

def load_save_image_from_url(url,OUT_DIR,img_name):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img_format = url.split('.')[-1]
img_name = img_name+'.'+img_format
img.save(OUT_DIR+img_name)
return None

csv = args['CSV']
DIR = args['DIR']
ind = 0
if args.get('index'):
ind = args['index']
df = pd.read_csv(csv) # read csv
indices = [int(f.split('.')[0]) for f in listdir(DIR) if isfile(join(DIR, f))] # get existing images
total_images_already = len(indices)
print(f'There are already {len(indices)} images present in the directory -{DIR}-n')
start = 0
if len(indices):
start = max(indices)+1 # set strating index

end = 5000 # next n numbers of images to download
if args.get('end'):
end = args['end']
print(f'Downloaded a total of {total_images_already} images upto index: {start-1}. Downloading the next {end} images from -{csv}-n')
count = 0
for i in range(start, start+end):
if count%250==0:
print(f"Total {total_images_already+count} images downloaded in directory. {end-count} remaining from the current definedn")
url = df.iloc[i,ind]
try:
load_save_image_from_url(url,DIR,str(i))
count+=1
except (KeyboardInterrupt, SystemExit):
sys.exit("Forced exit prompted by User: Quitting....")
except Exception as e:
print(f"Error at index {i}: {e}n")
pass

我想添加一个函数，当出现类似No internet或connection error的情况时，它会停止进程5分钟，而不是继续。在5次尝试后，即25分钟后，如果问题仍然存在，则应退出程序，而不是增加计数器。我想补充这一点，因为如果在2分钟内没有互联网，并且再次出现，它将运行整个循环，并开始从该索引下载图像。下次如果我运行这个程序，它会认为丢失的URL很糟糕，但只是没有互联网连接。

我该怎么做？

注意：很明显，我正在考虑使用time.sleep()，但我想知道哪个错误直接反映了requests中的No Internet或Connection Error？一个是from requests.exceptions import ConnectionError如果必须使用它，我如何使用它在同一个i计数器上继续尝试，直到5次尝试，如果不成功，则退出程序，并在成功连接后运行常规循环。

使用指数退避比睡眠更好。

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
retry_strategy = Retry(
total=3,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)
response = http.get(url)

在这里，您可以按如下方式配置参数：

total=3-要进行的重试尝试的总数
backoff_factor-它允许您更改进程在失败请求之间的睡眠时间

退避系数的公式如下：{backoff factor} * (2 ** ({number of total retries} - 1))

所以10秒的后退5、10、20、40、80、160、320、640、1280、2560-这些是后续请求之间的睡眠时间

我曾经使用过谷歌API，偶尔没有互联网、error423或类似的东西。
我将整个代码保存在try块中，并在块(除外(中应用.sleep((达X秒
这样我就不必搜索错误类型

需要注意的是，在执行之前，请确保您的代码没有任何其他类型的错误，并且将平稳运行，除非遇到"无互联网"或"网络错误">

这是我处理这个问题的方法

import libs
basic operations
try:
block 1
block 2
block_where_error_occurs
block 3
except:
print("Network error") 
time.sleep(X-Seconds)

我希望这对你也有帮助。如果这种方法不符合您的目的，请告诉我。

这是我处理这个问题的方法

相关内容

最新更新

热门标签：