制作列表大小不等的TF数据集



我正试图从这个字典中创建一个tf数据集,其中数据集将有四个元素,最后一个元素具有与其他列表不同的列表。

当这样做时,我得到一个错误ValueError: Can't convert non-rectangular Python sequence to Tensor..

这里解释的解决方案-使用tf.ragged.constant(data)不工作,因为我使用字典。有办法制作这样的数据集吗?

t_dic = {"uuid": np.array(["abc", "def", "ghi", "pqr"]),
"a": [np.array([1, 2, 3]), 
np.array([6, 2, 3]), 
np.array([6, 8, 1]), 
np.array([6, 2, 3, 10])],
"b": [np.array(["a", "f", "f"]), 
np.array(["aa", "ff", "fs"]), 
np.array(["aa", "ff", "fs"]), 
np.array(["aa", "ff", "fs", "ss"])]}
x = tf.data.Dataset.from_tensor_slices(t_dic)

如果你想保持未来张量的形状而不添加填充,我建议弹出各种列表长度的键,然后在新字典中tf.ragged.constant()它们。

在你的例子中:

t_dic = {"uuid": np.array(["abc", "def", "ghi", "pqr"]),
"a": [np.array([1, 2, 3]), 
np.array([6, 2, 3]), 
np.array([6, 8, 1]), 
np.array([6, 2, 3, 10])],
"b": [np.array(["a", "f", "f"]), 
np.array(["aa", "ff", "fs"]), 
np.array(["aa", "ff", "fs"]), 
np.array(["aa", "ff", "fs", "ss"])]}
key_a = t_dic.pop("a")  # popping "a" from t_dic
key_b = t_dic.pop("b")  # popping "b" from t_dic
ragged_features = {"a": tf.ragged.constant(key_a), "b": tf.ragged.constant(key_b)}  # creating a new dictionary with tf.ragged values of "a" and "b"
preprocessed_data = t_dic | ragged_features  # joining the former and later dictonary
x = tf.data.Dataset.from_tensor_slices(preprocessed_data)  # transforming in the desired output

我发现有用的,以及,是MapDataset从您的x:

x2 = x.map(lambda x: {
"uuid": x["uuid"],
"a": x["a"],
"b": x["b"]
})

输出x2可以迭代、批处理和映射,例如:

for key in x2.take(3).as_numpy_iterator():
pprint.pprint(key)
x2.element_spec  # useful to check if the shape is what you want, in this case 'None' means various shapes 

,输出为:

{'a': array([1, 2, 3]),
'b': array([b'a', b'f', b'f'], dtype=object),
'uuid': b'abc'}
{'a': array([6, 2, 3]),
'b': array([b'aa', b'ff', b'fs'], dtype=object),
'uuid': b'def'}
{'a': array([6, 8, 1]),
'b': array([b'aa', b'ff', b'fs'], dtype=object),
'uuid': b'ghi'}
{'uuid': TensorSpec(shape=(), dtype=tf.string, name=None),
'a': TensorSpec(shape=(None,), dtype=tf.int32, name=None),
'b': TensorSpec(shape=(None,), dtype=tf.string, name=None)}

最后,如果您要批处理tf.data。数据集,请记住,可能需要使用dense_to_ragged_batch(),像这样:

x2_batched = x2.apply(tf.data.experimental.dense_to_ragged_batch(batch_size=2))

链接:批处理粗糙张量

最新更新