keras pad_sequence and Tokenizer



我在kaggle数据集上学习为了在nlp上练习,我在标记推文并填充推文时遇到了一个错误,我找到了一个解决方案,但没有得到答案
# Get tha max Number Of Word In Tweets
texts = df['text']
LENGTH = texts.apply(lambda p:len(p.split()))
x = df ['text']
y = df['target']
x_train,x_test , y_train,y_test =train_test_split(x,y,test_size=.30,random_state=41)

tokenize = Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)
print('start padding ...')
# Padding Tweets To Be The Same Length
x = pad_sequences(x ,maxlen=LENGTH)

我收到这个错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_34/2607522322.py in <module>
8 
9 # Padding Tweets To Be The Same Length
---> 10 x = pad_sequences(x ,maxlen=LENGTH)
/opt/conda/lib/python3.7/site-packages/keras/preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
152   return sequence.pad_sequences(
153       sequences, maxlen=maxlen, dtype=dtype,
--> 154       padding=padding, truncating=truncating, value=value)
155 
156 keras_export(
/opt/conda/lib/python3.7/site-packages/keras_preprocessing/sequence.py in pad_sequences(sequences, maxlen, dtype, padding, truncating, value)
83                          .format(dtype, type(value)))
84 
---> 85     x = np.full((num_samples, maxlen) + sample_shape, value, dtype=dtype)
86     for idx, s in enumerate(sequences):
87         if not len(s):
/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order, like)
340         fill_value = asarray(fill_value)
341         dtype = fill_value.dtype
--> 342     a = empty(shape, dtype, order)
343     multiarray.copyto(a, fill_value, casting='unsafe')
344     return a
TypeError: 'Series' object cannot be interpreted as an integer

问题是LENGTH不是integer而是Pandas series。试试这样的东西:

from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf 
df = pd.DataFrame({'text': ['is upset that he cant update his Facebook by texting it... and might cry as a result  School today also. Blah!',
'@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds',
'my whole body feels itchy and like its on fire', 
'@nationwideclass no, its not behaving at all. im mad. why am i here? because I cant see you all over there.',
'@Kwesidei not the whole crew'],
'target': [0, 1, 0, 0, 1]})
x = df['text'].values
y = df['target'].values
max_length = max([len(d.split()) for d in x])
x_train, x_test ,y_train, y_test =train_test_split(x,y,test_size=.30,random_state=41)
tokenize = tf.keras.preprocessing.text.Tokenizer()
tokenize.fit_on_texts(x)
x = tokenize.texts_to_sequences(x)
print('start padding ...')
x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length)
print(x)
start padding ...
[[ 9 10 11 12  3 13 14 15 16 17 18  4 19 20 21 22 23 24 25 26 27]
[ 0  0  0 28  1 29 30 31 32  2 33 34 35 36 37  2 38 39 40 41 42]
[ 0  0  0  0  0  0  0  0  0  0  0 43  5 44 45 46  4 47  6 48 49]
[50 51  6  7 52 53  8 54 55 56 57  1 58 59  1  3 60 61  8 62 63]
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 64  7  2  5 65]]

如果你想使用后填充,运行:

x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length, padding='post')

相关内容

  • 没有找到相关文章

最新更新