如何在管道中对文本(不平衡的组)重新采样

我正在尝试使用多项式NB进行一些文本分类，但是由于数据不平衡，因此遇到了问题。（为简单起见，下面是一些示例数据。实际上，我的要大得多。我正在尝试使用过度采样对数据进行重采样，理想情况下，我想将其构建到此管道中。

下面的管道在没有过度采样的情况下工作正常，但同样，在现实生活中，我的数据需要它。这是非常不平衡的。

使用此当前代码，我不断收到错误："类型错误：所有中间步骤都应该是转换器并实现适合和转换。

如何将 RandomOverSampler 构建到此管道中？

data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'], 
    ['small fruits', 'grapes']]
df = pd.DataFrame(data,columns=['Description','Type'])  
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()), 
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print('Score:',text_clf.score(X_test, y_test))

您应该使用 imblearn 包中实现的管道，而不是 sklearn 中的管道。例如，此代码运行良好：

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline

data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'],
    ['small fruits', 'grapes']]
df = pd.DataFrame(data, columns=['Description','Type'])
X_train, X_test, y_train, y_test = train_test_split(df['Description'],
    df['Type'], random_state=0)
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)
print('Score:',text_clf.score(X_test, y_test))

相关内容

最新更新

热门标签：