如何在使用ImageDataGenerator时随机分割图像数据集



假设有一个基于目录的结构化图像训练集用于分类问题,如下所示:

main_directory/
...class_a/
......a_image_1.jpg
......a_image_2.jpg
...class_b/
......b_image_1.jpg
......b_image_2.jpg

我想将训练集随机分成多个子集(最好是不同大小的子集),以馈送到袋装集成的多个深度学习模型中。. 用于读取数据集的库为keras.preprocessing.image.ImageDataGenerator.

我知道flow_from_directory()能够通过设置validation_split属性为True将训练集分成训练和验证的两个期望子集。然而,这还不够。我需要将我的训练集随机分成多个子集。


注:我能想到的一种方法是手动打乱每个子目录的内容,然后将图像划分到一些单独的目录中,并分别为每个目录调用flow_from_directory()。然而,我不太确定这是否是一个实用的解决方案。

您可以使用imagedataggenerator .flow_from_dataframe来完成此操作。这很麻烦,但可以做到。下面的代码将读取图像数据并生成一个train_df、一个test_df和一个valid_df。您可以循环遍历该函数,并通过每次将tr_split、vsplit和random_ste设置为不同的值来创建不同的数据集,从而生成唯一的数据集。然后,您可以为每个独特的数据框架集生成生成器,并将它们用作model.fit的输入。

def preprocess (sdir, trsplit, random_seed):    
for category in ['train', 'test']:
filepaths=[]
labels=[]
catpath=os.path.join(sdir, category)
classlist=os.listdir(catpath)
for klass in classlist:
classpath=os.path.join(catpath,klass)
flist=os.listdir(classpath)
for f in flist:
fpath=os.path.join(classpath,f)
filepaths.append(fpath)
labels.append(klass)
Fseries=pd.Series(filepaths, name='filepaths')
Lseries=pd.Series(labels, name='labels')
if category == 'train':
df=pd.concat([Fseries, Lseries], axis=1)
else:
test_df=pd.concat([Fseries, Lseries], axis=1)       
# split df into train_df and test_df 
strat=df['labels']    
train_df, valid_df=train_test_split(df, train_size=trsplit, shuffle=True, random_state=random_seed, stratify=strat)    
print('train_df length: ', len(train_df), '  test_df length: ',len(test_df), '  valid_df length: ', len(valid_df))
print(train_df['labels'].value_counts())
return train_df, test_df, valid_df    

下面的代码显示了调用该函数的示例

sdir=r'C:Tempmalig'
random_seed =123
tr_split=.8
train_df, test_df, valid_df= preprocess(sdir, tr_split, random_seed)

结果输出将是

train_df length:  2109   test_df length:  660   valid_df length:  528
benign       1152
malignant     957
Name: labels, dtype: int64

现在,您可以随意多次调用这个函数,每次都更改random_seed和/或tr_split的值来创建不同的训练和有效的数据帧。然后用

下面的代码创建生成器
img_size=(224,224)
channels=3
batch_size=30
img_shape=(img_size[0], img_size[1], channels)
length=len(test_df)
test_batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=80],reverse=True)[0]  
test_steps=int(length/test_batch_size)
print ( 'test batch size: ' ,test_batch_size, '  test steps: ', test_steps)
def scalar(img):    
return img  # EfficientNet expects pixelsin range 0 to 255 so no scaling is required
trgen=ImageDataGenerator(preprocessing_function=scalar, horizontal_flip=True)
tvgen=ImageDataGenerator(preprocessing_function=scalar)
train_gen=trgen.flow_from_dataframe( train_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=True, batch_size=batch_size)
test_gen=tvgen.flow_from_dataframe( test_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=False, batch_size=test_batch_size)
valid_gen=tvgen.flow_from_dataframe( valid_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
color_mode='rgb', shuffle=True, batch_size=batch_size)
classes=list(train_gen.class_indices.keys())
class_count=len(classes)
train_steps=int(np.ceil(len(train_gen.labels)/batch_size))
结果输出应该显示
test batch size:  66   test steps:  10
Found 2109 validated image filenames belonging to 2 classes.
Found 660 validated image filenames belonging to 2 classes.
Found 528 validated image filenames belonging to 2 classes.

现在构建模型,然后运行model。适合使用这些生成器

最新更新