微调Senment140数据集上的序列分类BERT会得到非常糟糕的结果



我使用的是:

  • 情感140数据集
  • BertTokenizerFast用于文本标记化
  • 用于文本分类的TFBertForSequenceClassification

我想微调数据集(sentiment140(上的模型(TFBertForSequenceClassification(。

这样做的时候,我的模特的表现真的很糟糕。

10公里推特(约1小时训练(:

  • ROC AUC评分:0.131
  • 平均精度得分:0.325

有100万条推文(约9小时训练(:

  • ROC AUC评分:0.883
  • 平均精度得分:0.822

Kaggle中提供了笔记本运行。

我一定错过了一些显而易见的东西,但我真的找不到什么。。。是";只是";大量的训练数据问题?还是我没有使用正确的参数/指标/优化器?

这基本上是我的代码:

from tqdm import tqdm
# Maths modules
import numpy as np
import pandas as pd
import tensorflow as tf
# Load data from CSV
df = pd.read_csv(
"../input/sentiment140/training.1600000.processed.noemoticon.csv",
names=["target", "id", "date", "flag", "user", "text"],
encoding="ISO-8859-1",
)
# Drop useless columns
df.drop(columns=["id", "date", "flag", "user"], inplace=True)

# Replace target values with labels
df.target.replace(
{
0: "NEGATIVE",
2: "NEUTRAL",
4: "POSITIVE",
},
inplace=True,
)
# And back to binary values
df.target.replace(
{
"NEGATIVE": 0,
"POSITIVE": 1,
},
inplace=True,
)

# Sample data for development
TEXT_SAMPLE_SIZE = 10000  # <= 0 for all
# Sample data
if TEXT_SAMPLE_SIZE > 0:
df = df.groupby("target", group_keys=False).apply(
lambda x: x.sample(
n=int(TEXT_SAMPLE_SIZE / df["target"].nunique()), random_state=42
)
).reset_index(drop=True)

# Bert Tokenizers
from transformers import BertTokenizerFast
BERT_MODEL = "bert-base-uncased"
tokenizer = BertTokenizerFast.from_pretrained(BERT_MODEL, do_lower_case=True)
input_ids = np.asarray([tokenizer(sent, padding="max_length", truncation=True)["input_ids"] for sent in tqdm(df.text)])
attention_mask = np.asarray([tokenizer(sent,padding="max_length",truncation=True)["attention_mask"] for sent in tqdm(df.text)])
token_type_ids = np.asarray([tokenizer(sent,padding="max_length",truncation=True)["token_type_ids"] for sent in tqdm(df.text)])

from sklearn.model_selection import train_test_split

# Train-test split
(
texts_train,
texts_test,
input_ids_train,
input_ids_test,
attention_mask_train,
attention_mask_test,
token_type_ids_train,
token_type_ids_test,
labels_train,
labels_test,
) = train_test_split(
df.text.values,
input_ids,
attention_mask,
token_type_ids,
df.target.values,
test_size=0.2,
stratify=df.target.values,
random_state=42,
)

from transformers import TFBertForSequenceClassification
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.metrics import BinaryAccuracy

# Define NN model
print("Defining model...")
model = TFBertForSequenceClassification.from_pretrained(
BERT_MODEL, num_labels=2
)
# compile NN network
print("Compiling model...")
model.compile(
loss=BinaryCrossentropy(),
optimizer=Adam(learning_rate=2e-5), # Value recommended by the Bert team
metrics=BinaryAccuracy(),
)
# fit NN model
print("Fitting model...")
model.fit(
[input_ids_train, attention_mask_train, token_type_ids_train],
labels_train,
epochs=10,
batch_size=8,
validation_split=0.2,
callbacks=[
EarlyStopping(monitor="val_loss", patience=2),
],
workers=4,
use_multiprocessing=True,
)
print(model.summary())

# Get predictions
y_pred = model.predict([input_ids_test, attention_mask_test, token_type_ids_test])
y_pred_proba = [float(x[1]) for x in tf.nn.softmax(y_pred.logits)]
y_pred_label = [0 if x[0] > x[1] else 1 for x in tf.nn.softmax(y_pred.logits)]

# Evaluate the model
from sklearn.metrics import (
confusion_matrix,
roc_auc_score,
average_precision_score,
)
print("Confusion Matrix : ")
print(confusion_matrix(labels_test, y_pred_label))
print("ROC AUC score : ", round(roc_auc_score(labels_test, y_pred_proba), 3))
print("Average Precision score : ", round(average_precision_score(labels_test, y_pred_proba), 3))

有了这个,我在训练时得到了这些日志:

Defining model...
All model checkpoint layers were used when initializing TFBertForSequenceClassification.
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Compiling model...
Fitting model...
Epoch 1/10
[...]
[==============================] - 5756s 72ms/step - loss: 0.4057 - binary_accuracy: 0.8482 - val_loss: 0.4579 - val_binary_accuracy: 0.8421
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
None

以及这些分类性能:

Confusion Matrix : 
[[17631 82369]
[ 415 99585]]
ROC AUC score :  0.883
Average Precision score :  0.822

谢谢你的帮助!

首先,您应该尝试使用BERTweet作为基础模型,它应该可以提高性能。BERT_MODEL = "vinai/bertweet-base"

其次,我个人正在使用Pytorch:以下是我在用例中使用的实现:


from transformers import AutoModelForSequenceClassification, 
AutoTokenizer, 
TrainingArguments, 
Trainer,
pipeline
from transformers.data.data_collator import DataCollatorWithPadding
import torch
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import precision_recall_fscore_support, 
accuracy_score, 
classification_report

train_texts = train_df.text.tolist()
val_texts = val_df.text.tolist()
train_encodings = tokenizer(train_texts, truncation=True, padding='max_length')
val_encodings = tokenizer(val_texts, truncation=True, padding='max_length')
class SentiDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = SentiDataset(train_encodings, train_df.label.tolist())
eval_dataset = SentiDataset(val_encodings, val_df.label.tolist())

def compute_metrics(pred, id2label):
"""
Compute metrics for Trainer
"""
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
return get_metrics(preds, labels, id2label)

def get_metrics(preds, labels, id2label):
ret = {}
f1s, precs, recalls = [], [], []
for i, cat in id2label.items():
cat_labels, cat_preds = labels == i, preds == i
precision, recall, f1, _ = precision_recall_fscore_support(
cat_labels, cat_preds, average='binary', zero_division=0,
)
f1s.append(f1)
precs.append(precision)
recalls.append(recall)
ret[cat.lower()+"_f1"] = f1
ret[cat.lower()+"_precision"] = precision
ret[cat.lower()+"_recall"] = recall
_, _, micro_f1, _ = precision_recall_fscore_support(
labels, preds, average="micro"
)
ret["micro_f1"] = micro_f1
ret["macro_f1"] = torch.Tensor(f1s).mean()
# ret["macro_precision"] = torch.Tensor(precs).mean()
# ret["macro_recall"] = torch.Tensor(recalls).mean()
# ret["acc"] = accuracy_score(labels, preds)
return ret

epochs = 1
total_steps = (epochs * len(train_dataset)) // batch_size
warmup_steps = total_steps // 10
batch_size = 16
eval_batch_size = 32
training_args = TrainingArguments(
output_dir='.results',
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=eval_batch_size,
eval_steps=100,
warmup_steps=warmup_steps,
evaluation_strategy="steps",
do_eval=False,
save_strategy="epoch",
weight_decay=0.01,
logging_dir='./logs',
metric_for_best_model="eval_loss",
greater_is_better=False,
)
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=lambda x: compute_metrics(x, id2label=id2label),
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator
)```

尝试在中使用num_labels=1

model = TFBertForSequenceClassification.from_pretrained(
BERT_MODEL, num_labels=2
)

你只需要一节课;情感类;(接近0=>负,接近1=>正(。

我也遇到了同样的问题,这对我来说很有效(1K数据(。

Luc,(来自OC.IA.p7;((

最新更新