交叉验证和文本分类

我有相同的问题，在这里问：

我有一个有关在sklearn中使用交叉验证的问题。在交叉验证之前对所有数据进行矢量化是有问题的，因为分类器会"看到"测试数据中发生的词汇。Weka已过滤分类器来解决此问题。该功能的sklearn等效是什么？我的意思是对于每个折叠，功能集都不同，因为训练数据不同。

但是，因为我正在为分类步骤和分类步骤之间的数据进行大量处理，所以我无法使用管道...并且试图通过自我作为整个过程的外循环来实现交叉验证...对此的任何指导，因为我对Python和Sickitlearn

都很陌生

我认为使用交叉验证迭代器作为外循环是一个好主意，并且起点可以使您的步骤清晰可读：

from sklearn.cross_validation import KFold
X = np.array(["Science today", "Data science", "Titanic", "Batman"]) #raw text
y = np.array([1, 1, 2, 2]) #categories e.g., Science, Movies
kf = KFold(y.shape[0], n_folds=2)
for train_index, test_index in kf:
    x_train, y_train = X[train_index], y[train_index] 
    x_test, y_test = X[test_index], y[test_index]
    #Now continue with your pre-processing steps..

我可能会缺少您的问题的含义，并且不熟悉WEKA，但是您可以将词汇作为字典将词汇传递到Sklearn中使用的矢量器中。这是一个示例，它将仅使用火车集中的功能跳过测试集中的"第二"字。

from sklearn.feature_extraction.text import CountVectorizer
train_vectorizer = CountVectorizer()
train = [
    'this is the first',
    'set of documents'
    ]
train_matrix = train_vectorizer.fit_transform(train)
train_vocab = train_vectorizer.vocabulary_
test = [
    'this is the second',
    'set of documents'
    ]
test_vectorizer = CountVectorizer(vocabulary=train_vocab)
test_matrix = test_vectorizer.fit_transform(test)
print(train_vocab)
print(train_matrix.toarray())
print('n')
print(test_vectorizer.vocabulary_)
print(test_matrix.toarray())

还要注意，您可以在矢量机中使用自己的处理和/或令牌化过程，例如：

def preprocessor(string):
    #do logic here
def tokenizer(string):
    # do logic here
from sklearn.cross_validation import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
clf = Pipeline([('vect', TfidfVectorizer(processor=preprocessor, tokenizer=tokenizer)), ('svm', LinearSVC())])

相关内容

最新更新

热门标签：