TensorFlow 数据集:通过 For 循环迭代时顺序显得随机?

我正在创建一些批处理TensorFlow数据集tf.keras.preprocessing.image_dataset_from_directory：

image_size = (90, 120)
batch_size = 32
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(model_split_dir,'train'),
validation_split=0.25,
subset="training",
seed=1,
image_size=image_size,
batch_size=batch_size
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(model_split_dir,'train'),
validation_split=0.25,
subset="validation",
seed=1,
image_size=image_size,
batch_size=batch_size
)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
os.path.join(model_split_dir,'test'),
seed=1,
image_size=image_size,
batch_size=batch_size
)

如果我随后使用以下 for 循环从其中一个数据集获取图像和标签信息，则每次运行它时都会得到不同的输出：

for images, labels in test_ds:
print(labels)

例如，第一批将在一次运行中如下所示：

tf.Tensor([0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1], shape=(32,), dtype=int32)

但是当循环再次运行时，情况就完全不同了;

tf.Tensor([1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0], shape=(32,), dtype=int32)

每次循环顺序怎么会不一样？TensorFlow 数据集是无序的吗？从我发现的情况来看，它们应该是有序的，所以我不知道为什么 for 循环每次都以不同的顺序返回标签。

对此的任何见解将不胜感激。

更新：数据集顺序的随机排序按预期工作。对于我的测试数据，我只需要将随机播放设置为 False。非常感谢@AloneTogether！

tf.keras.preprocessing.image_dataset_from_directory的参数shuffle默认设置为True，如果你想要确定性的结果，可以尝试将其设置为False：

import tensorflow as tf
import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
image_size=(28, 28),
batch_size=5,
shuffle=False)
for x, y in train_ds:
print(y)
break

另一方面，这将始终产生随机结果：

train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
seed=None,
image_size=(28, 28),
batch_size=5,
shuffle=True)
for x, y in train_ds:
print(y)
break

如果您设置随机种子并shuffle=True，则数据集将被洗牌一次，但您将获得确定性结果：

train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
seed=123,
image_size=(28, 28),
batch_size=5,
shuffle=True)
for x, y in train_ds:
print(y)
break

相关内容

最新更新

热门标签：