分割数据集后的过度采样-文本分类



我对过度采样数据集的步骤有一些问题。我所做的是:

# Separate input features and target
y_up = df.Label
X_up = df.drop(columns=['Date','Links', 'Paths'], axis=1)
# setting up testing and training sets
X_train_up, X_test_up, y_train_up, y_test_up = train_test_split(X_up, y_up, test_size=0.30, random_state=27)
class_0 = X_train_up[X_train_up.Label==0]
class_1 = X_train_up[X_train_up.Label==1]

# upsample minority
class_1_upsampled = resample(class_1,
replace=True, 
n_samples=len(class_0), 
random_state=27) #
# combine majority and upsampled minority
upsampled = pd.concat([class_0, class_1_upsampled])

由于我的数据集看起来像:

Label     Text 
1        bla bla bla
0        once upon a time 
1        some other sentences
1        a few sentences more
1        this is my dataset!

我应用了一个矢量器将字符串转换为数字:

X_train_up=upsampled[['Text']]
y_train_up=upsampled[['Label']]
X_train_up = pd.DataFrame(vectorizer.fit_transform(X_train_up['Text'].replace(np.NaN, "")).todense(), index=X_train_up.index)

然后我应用了逻辑回归函数:

upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up, y_train_up)

然而,我在这一步中出现了以下错误:

X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
pred_up_log = upsampled_log.predict(X_test_up)

ValueError:X每个样本有3021个特征;预计5542

由于有人告诉我应该在将数据集拆分为训练e测试后应用过采样,所以我没有对测试集进行矢量化。我的疑虑如下:

  • 以后考虑测试集的矢量化是否正确:X_test_up = pd.DataFrame(vectorizer.fit_transform(X_test_up['Text'].replace(np.NaN, "")).todense(), index=X_test_up.index)
  • 在将数据集划分为训练和测试之后,考虑过度采样是正确的吗

或者,我尝试了Smote函数。下面的代码是有效的,但如果可能的话,我更愿意考虑过采样,而不是SMOTE。

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df['Text'],df['Label'], test_size=0.2,random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train_up)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train_tfidf, y_train_up)
print("Shape after smote is:",X_train_res.shape,y_train_res.shape)
nb = Pipeline([('clf', LogisticRegression())])
nb.fit(X_train_res, y_train_res)
y_pred = nb.predict(count_vect.transform(X_test_up))
print(accuracy_score(y_test_up,y_pred))

如有任何意见和建议,我们将不胜感激。感谢

最好对整个数据集进行计数矢量化和转换,分为测试和训练,并将其保持为稀疏矩阵,而不转换回数据帧。

例如,这是一个数据集:

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'Text':['This is bill','This is mac','here’s an old saying',
'at least old','data scientist years','data science is data wrangling', 
'This rings particularly','true for data science leaders',
'who watch their data','scientists spend days',
'painstakingly picking apart','ossified corporate datasets',
'arcane Excel spreadsheets','Does data science really',
'they just delegate the job','Data Is More Than Just Numbers',
'The reason that',
'data wrangling is so difficult','data is more than text and numbers'],
'Label':[0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0]})

我们执行矢量化和转换,然后进行拆分:

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
tfidf_transformer = TfidfTransformer()
df_tfidf = tfidf_transformer.fit_transform(df_counts)
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_tfidf,df['Label'].values, 
test_size=0.2,random_state=42)

上采样可以通过对少数类别的索引进行重新采样来完成:

class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
np.random.choice(class_1,len(class_0),replace=True)
))
upsampled_log = LogisticRegression(solver='liblinear').fit(X_train_up[up_idx,:], y_train_up[up_idx])

预测会起作用:

upsampled_log.predict(X_test_up)
array([0, 1, 0, 0])

如果您担心数据泄露,那么测试中的一些信息实际上会通过使用TfidfTransformer((进入训练。老实说,还没有看到这方面的具体证明或演示,但下面是一个单独应用tfid的替代方案:

count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(df['Text'])
X_train_up, X_test_up, y_train_up, y_test_up=train_test_split(df_counts,df['Label'].values, 
test_size=0.2,random_state=42)
class_0 = np.where(y_train_up==0)[0]
class_1 = np.where(y_train_up==1)[0]
up_idx = np.concatenate((class_0,
np.random.choice(class_1,len(class_0),replace=True)
))
tfidf_transformer = TfidfTransformer()
upsample_Xtrain = tfidf_transformer.fit_transform(X_train_up[up_idx,:])
upsamle_y = y_train_up[up_idx]
upsampled_log = LogisticRegression(solver='liblinear').fit(upsample_Xtrain,upsamle_y)
X_test_up = tfidf_transformer.transform(X_test_up)
upsampled_log.predict(X_test_up)

最新更新