我想我在下面的代码中遗漏了一些东西。
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
# Split into training and test sets
# Testing Count Vectorizer
X = df[['Spam']]
y = df['Value']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
sm = pd.concat([X_resampled, y_resampled], axis=1)
因为我得到错误
ValueError:无法将字符串转换为浮点值:--->19 X_resampled,y_resampled=SMOTE((.fit_resample(X_train,y_train(
数据示例为
Spam Value
Your microsoft account was compromised 1
Manchester United lost against PSG 0
I like cooking 0
我会考虑转换训练集和测试集来解决导致错误的问题,但我不知道如何同时应用于这两个问题。我在谷歌上试过一些例子,但它并没有解决这个问题。
在应用SMOTE之前将文本数据转换为数字,如下所示。
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X_train.values.ravel())
X_train=vectorizer.transform(X_train.values.ravel())
X_test=vectorizer.transform(X_test.values.ravel())
X_train=X_train.toarray()
X_test=X_test.toarray()
然后添加您的SMOTE代码
x_train = pd.DataFrame(X_train)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
您可以使用SMOTENC而不是SMOTE。SMOTENC直接处理分类变量。
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html#imblearn.over_sampling.SMOTENC
在将字符串数据输入SMOTE之前对其进行标记是一种选择。您可以使用任何令牌化器,下面的torch实现将类似于:
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64)
X, y = [], []
for batch in dataloader:
input_ids = batch['input_ids']
labels = batch['labels']
X.append(input_ids)
y.append(labels)
X_tensor = torch.cat(X, dim=0)
y_tensor = torch.cat(y, dim=0)
X = X_tensor.numpy()
y = y_tensor.numpy()
smote = SMOTE(random_state=42, sampling_strategy=0.6)
X_resampled, y_resampled = smote.fit_resample(X, y)