从具有许多标签的Pandas数据框架创建Tensorflow数据集?



我正在尝试加载一个pandas数据框到一个张量数据集。列是文本[字符串]和标签[字符串格式的列表]

一行看起来像这样:text: "嗨,这是我,...."标签:[0,1,1,0,1,0,0,…]

每个文本有17个标签的概率。

我找不到一种方法来加载数据集作为一个数组,并调用model.fit()我阅读了许多答案,试图在df_to_dataset()中使用以下代码。

我不知道我错过了什么

labels = labels.apply(lambda x: np.asarray(literal_eval(x)))  # Cast to a list
labels = labels.apply(lambda x: [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])  # Straight out list ..
#  ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

打印一行(从返回的数据集中)显示:

({'text': <tf.Tensor: shape=(), dtype=string, numpy=b'Text in here'>}, <tf.Tensor: shape=(), dtype=string, numpy=b'[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.0, 0, 0, 0, 0, 0, 0]'>)

当我不使用任何类型转换时,模型。Fit发送一个异常,因为它不能处理字符串。

UnimplementedError:  Cast string to float is not supported
[[node sparse_categorical_crossentropy/Cast (defined at <ipython-input-102-71a9fbf2d907>:4) ]] [Op:__inference_train_function_1193273]
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
labels = dataframe.pop('labels')
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
return ds
train_ds = df_to_dataset(df_train, batch_size=batch_size)
val_ds = df_to_dataset(df_val, batch_size=batch_size)
test_ds = df_to_dataset(df_test, batch_size=batch_size)
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.2)(net)
net = tf.keras.layers.Dense(17, activation='softmax', name='classifier')(net)
return tf.keras.Model(text_input, net)

classifier_model = build_classifier_model()
loss = 'sparse_categorical_crossentropy'
metrics = ["accuracy"]
classifier_model.compile(optimizer=optimizer,
loss=loss,
metrics=metrics)
history = classifier_model.fit(x=train_ds,
validation_data=val_ds,
epochs=epochs)

也许在使用tf.data.Dataset.from_tensor_slices之前尝试预处理您的数据帧。下面是一个简单的工作示例:

import tensorflow as tf
import tensorflow_text as tf_text
import tensorflow_hub as hub
import pandas as pd
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1', name='preprocessing')
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2', trainable=True, name='BERT_encoder')
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.2)(net)
net = tf.keras.layers.Dense(5, activation='softmax', name='classifier')(net)
return tf.keras.Model(text_input, net)
def remove_and_split(s):
s = s.replace('[', '') 
s = s.replace(']', '')  
return s.split(',')

def df_to_dataset(dataframe, shuffle=True, batch_size=2):
dataframe = dataframe.copy()
labels = tf.squeeze(tf.constant([dataframe.pop('labels')]), axis=0)
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)).batch(
batch_size)
return ds
dummy_data = {'text': [
"Improve the physical fitness of your goldfish by getting him a bicycle",
"You are unsure whether or not to trust him but very thankful that you wore a turtle neck",
"Not all people who wander are lost", 
"There is a reason that roses have thorns",
"Charles ate the french fries knowing they would be his last meal",
"He hated that he loved what she hated about hate",
], 'labels': ['[0, 1, 1, 1, 1]', '[1, 1, 1, 0, 0]', '[1, 0, 1, 0, 0]', '[1, 0, 1, 0, 0]', '[1, 1, 1, 0, 0]', '[1, 1, 1, 0, 0]']}  
df = pd.DataFrame(dummy_data)  
df["labels"] = df["labels"].apply(lambda x: [int(i) for i in remove_and_split(x)])
batch_size = 2
train_ds = df_to_dataset(df, batch_size=batch_size)
val_ds = df_to_dataset(df, batch_size=batch_size)
test_ds = df_to_dataset(df, batch_size=batch_size)
loss = 'categorical_crossentropy'
metrics = ["accuracy"]
classifier_model = build_classifier_model()
classifier_model.compile(optimizer='adam',
loss=loss,
metrics=metrics)
history = classifier_model.fit(x=train_ds,
validation_data=val_ds,
epochs=5)

在使用Bert预处理层时,不要忘记在tf.data.Dataset.from_tensor_slices中包括批处理大小。我还将您的损失函数更改为categorical_crossentropy,因为您正在使用单热编码标签(至少可以从您的问题中推断)。sparse_categorical_crossentropy损失函数期望整数标签不是单热编码。

您可以在map方法中使用tf.strings函数。

import tensorflow as tf
x = ['[0, 1, 0]', '[1, 1, 0]']

def splitter(string):
string = tf.strings.substr(string, 1, tf.strings.length(string) - 2) # no brackets
string = tf.strings.split(string, ', ')                              # isolate int
string = tf.strings.to_number(string, out_type=tf.int32)             # as integer
return string

ds = tf.data.Dataset.from_tensor_slices(x).map(splitter)
next(iter(ds))
<tf.Tensor: shape=(3,), dtype=int32, numpy=array([0, 1, 0])>

话虽这么说,你也可以改变你的DataFrame,使目标是一次性编码。