文本预处理后,准确性越来越差



我正在进行一个多类文本分类项目。

在将数据集拆分为训练数据集和测试数据集后,我在训练数据集上应用了以下函数(AKA预处理(:

STOPWORDS = set(stopwords.words('english'))
def clean_text(text):   
# lowercase text
text = text.lower() 

# delete bad symbols
text = re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z t])|(w+://S+)|^rt|http.+?", "", text)  

# delete stopwords from text
text = ' '.join(word for word in text.split() if word not in STOPWORDS) 
# Stemming the words
text = ' '.join([stemmer.stem(word) for word in text.split()])

return text

令我惊讶的是,我在火车数据集上得到了最差的结果(即va_accurcy(,而不仅仅是";什么都不做";(59%对69%(

我已经在下面的部分评论了申请行:

all_data = dataset.sample(frac=1).reset_index(drop=True)
train_df, valid = train_test_split(all_data, test_size=0.2)
train_df['text'] = train_df['text'].apply(clean_text)

我错过了什么?预处理步骤怎么会降低准确性?

更多信息

我忘了提到我正在使用以下内容来标记文本:

X_train = train.iloc[:, :-1]
y_train = train.iloc[:, -1:]
X_test = valid.iloc[:, :-1]
y_test = valid.iloc[:, -1:]
weights = class_weight.compute_class_weight(class_weight='balanced', classes=np.unique(y_train), 
y=y_train.values.reshape(-1))
le = LabelEncoder()
le.fit(weights)
class_weights_dict = dict(zip(le.transform(list(le.classes_)), weights))

tokenizer = Tokenizer(num_words=vocab_size, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train['text'])
train_seq = tokenizer.texts_to_sequences(X_train['text'])
train_padded = pad_sequences(train_seq, maxlen=max_length, padding=padding_type, truncating=trunc_type)
validation_seq = tokenizer.texts_to_sequences(X_test['text'])
validation_padded = pad_sequences(validation_seq, maxlen=max_length, padding=padding_type, truncating=trunc_type)

稍后,我将按照以下方式将所有内容放入模型中:

model = Sequential()
model.add(Embedding(vocab_size, embedding_dim, input_length=train_padded.shape[1]))
model.add(Conv1D(48, len(GROUPS), activation='relu', padding='valid'))
model.add(GlobalMaxPooling1D())
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(len(GROUPS), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 100
batch_size = 32
history = model.fit(train_padded, training_labels, shuffle=True ,
epochs=epochs, batch_size=batch_size,
class_weight=class_weights_dict,
validation_data=(validation_padded, validation_labels),
callbacks=[ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.0001), 
EarlyStopping(monitor='val_loss', mode='min', patience=2, verbose=1),
EarlyStopping(monitor='val_accuracy', mode='max', patience=5, verbose=1)])

您可以验证这个假设/调试哪个步骤会降低准确性:修复训练/测试分割,并尝试打开和关闭四个预处理步骤中的每一个,保持其他三个步骤打开。修复训练/考试分割,运行您的分类方案。对每个预处理步骤重复上述步骤。然后进行比较。

如果非要我猜的话,堵塞步骤就是制造你的问题(准确性较低(。

# Stemming the words
text = ' '.join([stemmer.stem(word) for word in text.split()])

填词使有细微(有时也不那么细微(差异(如后缀(的文本完全相同。这会使来自不同类别的文本的矢量表示比其他类别的文本更相似,可能会使它们更难区分,从而使您的分类准确性更差。

最新更新