我是TensorFlow的tf.data.Dataset
的新手,我试图在我的数据上使用它,我加载了pandas数据框,如下所示:
加载输入日期(df_input):
id messages Label
0 11 I am not driving home 0
1 11 Please pick me up 1
2 103 The car already park 1
3 103 No need for ticket 0
4 104 I will buy a car 1
5 104 I will buy truck 1
我做预处理和应用文本矢量化如下:
text_vectorizer = layers.TextVectorization(max_tokens=20, output_mode="int", output_sequence_length=6)
text_vectorizer.adapt(df_input.message.values.tolist())
def encode(texts):
encoded_texts = text_vectorizer(texts)
return encoded_texts.numpy()
train_data = encode(df_input.message.values) ## This the training data
train_label = tf.keras.utils.to_categorical(df_input.label.values, 2) ## This labels
然后我使用TensorFlowtf.data.Dataset
函数在训练模型中使用预处理数据,如下所示:
train_dataset_df = (
tf.data.Dataset.from_tensor_slices((train_data, train_label))
.shuffle(1000)
.batch(2)
)
问题我我可以如何变换每个训练epoch的数据通过将我的自定义函数应用于训练数据。从这里到这篇文章,我看到了一个通过.map
函数执行转换的使用示例:
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
我的目标是应用我的自定义函数如下(重新排序文本数据中的单词):
def order_augment_sent(Sentence):
words = Sentence.split(" ")
words.sort()
newSentence = " ".join(words)
return newSentence
train_dataset_ds = (
tf.data.Dataset.from_tensor_slices((train_data, train_label))
.shuffle(1000)
.batch(2)
.map(lambda x, y: (order_augment_sent(x), y))
)
但我得到错误作为:
AttributeError: 'Tensor' object has no attribute 'split'
或者如果我应用我的其他自定义函数,我得到:
TypeError: To be compatible with tf.function, Python functions must return zero or more Tensors or ExtensionTypes or None values; in compilation of <function _tf_if_stmt.<locals>.aug_body at 0124f565>, found return value of type WarningException, which is not a Tensor or ExtensionType.
我不知道如何才能做到这一点,如果你有任何想法或解决方案来帮助我,我将不胜感激。
您在lambda函数中获得的参数是来自向量的token,因此它们是int。如果您想重新排序文本数据,您需要在text_vectorizer之前进行。
所以你应该添加TextVectorization层到你的模型,这样你的地图函数将有字符串,你可以在调用TextVectorization之前重新排序句子。
这是一个几乎可以工作的例子,你只需要用你需要的代码编辑order_augment_sent函数,我不知道你想做什么样的排序,可能你必须用numpy编写一个自定义排序https://www.tensorflow.org/api_docs/python/tf/py_function
import tensorflow as tf
import numpy as np
train_data = ["I am not driving home", "Please pick me up", "The car already park", " No need for ticket", "I will buy a car", "I will buy truck"]
train_label = [0,1,1,0,1,1]
text_dataset = tf.data.Dataset.from_tensor_slices(train_data)
max_features = 5000 # Maximum vocab size.
max_len = 4 # Sequence length to pad the outputs to.
# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=max_len)
# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(train_data)
# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()
# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)
def apply_order_augment_sent(s):
Sentence = s.decode('utf-8')
words = Sentence.split(" ")
words.sort()
newSentence = " ".join(words)
return(newSentence)
def order_augment_sent(x: np.ndarray, y:np.ndarray):
new_x = []
for i in range(len(x)):
new_x.append(np.array([apply_order_augment_sent(x[i])]))
print('new', new_x, y)
return(new_x, y)
train_dataset_ds = tf.data.Dataset.from_tensor_slices((train_data, train_label))
train_dataset_ds = train_dataset_ds.shuffle(1000).batch(32)
train_dataset_ds = train_dataset_ds.map(lambda item1, item2: tf.numpy_function(
order_augment_sent, [item1, item2], [tf.string, tf.int32]))
list(train_dataset_ds.as_numpy_iterator())
model.predict(train_dataset_ds)