为什么我无法将图像数据集拆分为 8:1:1?



我尝试将我的数据集分割为8:1:1,我的数据集在一个目录中,首先我尝试以下代码

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
dir,
validation_split=0.1,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)

但是它不做这项工作,只是将我的目录分割为val_ds和test_ds在此之后,我使用以下代码

# create a data generator
datagen = ImageDataGenerator()
# load and iterate training dataset
train_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')
# load and iterate validation dataset
val_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')
# load and iterate test dataset
test_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')

这段代码对我的模型也有一个问题,所以当我使用这段代码时,我的模型摘要将像这样

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
rescaling_1 (Rescaling)      (None, None, None, None)  0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, None, None, 32)    320       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, None, None, 32)    0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, None, None, 32)    9248      
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, None, None, 32)    0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, None, None, 32)    9248      
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, None, None, 32)    0         
_________________________________________________________________
dropout_1 (Dropout)          (None, None, None, 32)    0         
_________________________________________________________________
flatten_1 (Flatten)          (None, None)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_3 (Dense)              (None, 26)                3354      
=================================================================
Total params: 38,682
Trainable params: 38,682
Non-trainable params: 0
_________________________________________________________________

这是我的模型

num_classes = 26
model = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.Rescaling(1./255),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.MaxPooling2D(),
layers.Dropout(0.2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(num_classes)
])
model.compile(
optimizer='adam',
loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])

所以我需要知道如何在没有问题的情况下分割我的数据?

您可以执行以下操作

import glob # To get the whole path for all the images
'''
Let's consider that your images lie inside 2 folders - 'a' and 'b' which are inside your 'dir' folder. To get paths to each of those images you can use
the below code
'''
image_paths_a = glob.glob('./dir/a/*.jpg') # .jpg if the files ends with jpg
image_paths_b = glob.glob('./dir/b/*.jpg') # to get images from b
images_total = image_paths_a + image_paths_b
# In case you have other folders you can also do this
# to get all images inside all folder in 'dir' folder.
images_total = glob.glob('./dir/*/*.jpg') 
# Now get the labels corresponding to these images
# If you have labeled as folder names then you can do it like this
image_labels = [i.split('/')[-2] for i in images_total]
'''
After doing the above you have 2 lists -> 1) Image paths 2) corresponding labels and now you can just use the 'sklearn.model_selection.train_test_split' to get your splits
'''
from sklearn.model_selection import train_test_split
# To set train data and get rest 20% for further split
xtrain, xtest, ytrain, ytest = trian_test_split(images_total, 
image_labels,
stratify=image_labels,
random_state=1234,
test_size=0.2)
# get 10%-10% of original data
xvalid, xtest, yvalid, ytest= trian_test_split(xtest, 
ytest,
stratify=ytest,
random_state=1234,
test_size=0.5)
'''
Now you can just create a dataset but before that you will create a function to read images from image paths.
'''
def read_img(path, label):
file = tf.io.read_file(path)
img = tf.image.decode_png(file)
# dim1 and dim2 are your desired dimensions
img = tf.image.resize(img, (dim1, dim2))
return img, label
train_dataset = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))
train_dataset = train_dataset.map(read_img).batch(batch_size)
valid_dataset = tf.data.Dataset.from_tensor_slices((xvalid, yvalid))
valid_dataset = valid_dataset.map(read_img).batch(batch_size)
test_dataset = tf.data.Dataset.from_tensor_slices((xtest, ytest))
test_dataset = test_dataset.map(read_img).batch(batch_size)
# Now you just need to train your model
model.fit(train_dataset, epochs=5, validation_data=valid_dataset)

最新更新