sklearn.pipeline.Pipeline:拟合与训练文本不同语料库中的CountVectorizer



我正在通过示例管道进行文本特征提取和评估scikit-learn文档中的示例。在这里,它们显示了以下管道

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
pipeline = Pipeline(
[
("vect", CountVectorizer()),
("tfidf", TfidfTransformer()),
("clf", SGDClassifier()),
]
)

,他们随后继续与GridSearchCV一起使用。在这个例子中,他们将CountVectorizer拟合到训练数据集上,然后提取特征。我想做的是在一个更大的语料库上拟合CountVectorizer,然后将其应用于训练数据以获得特征向量。是否有一种简单的方法可以在维护sklearn.pipeline.PipelineAPI的同时做到这一点,即不子类化sklearn.pipeline.Pipeline并显着改变其方法?

我想维护sklearn.pipeline.PipelineAPI,因为我希望利用GridSearchCV,并以这种方式结构化它将非常方便和干净。

from sklearn.feature_extraction.text import CountVectorizer
# supppose corpus is your big corpus 
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',]

# first train it on big corpus , and get the feature name from that
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# now train your new dataset using the vocabulary from the above training datasert
vocabulary  = vectorizer.get_feature_names() 
new_train_corpus = ["how are you doing", "I am fine", "I am reading first document"]
new_vect = CountVectorizer(vocabulary = vocabulary) #using vocabulary from previous training here 
new_vect.fit_transform(new_train_corpus)
new_vect.get_feature_names()
#op all new vocabulary will get ignored , and vectorizer object will used only this vocabulary 

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

注意如果你有固定的关键字列表,那么你可以直接传递词汇表,但如果你想训练并进行特征选择并训练它然后在训练数据集中使用该词汇表

在文档中给出了如何使用GridsearchCv与Pipelinehttps://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py

pipeline = Pipeline(
[
("vect", CountVectorizer(vocabulary = vocabulary)), ## pass vocabulary here
("tfidf", TfidfTransformer()),
("clf", yourmodel()),
]
) 

根据需要设置参数并在GridSearchCV中传递

grid_search = GridSearchCV(pipeline, parameters)

最新更新