scikit学习特征子集上的联合网格搜索



如何在scikit learn中使用FeatureUnion,以便Gridsearch可以选择性地处理其部件?

下面的代码工作并设置了一个FeatureUnion,其中包含用于单词的TfidfVectorizer和用于字符的Tfidf Vector。

在进行Gridsearch时,除了测试定义的参数空间外,我还想只测试带有ngram_range参数的"vect_wordvect"(没有用于字符的TfidfVectorizer),以及带有小写参数True和False的"vect__lettervet",另一个TfidfVectorizer被禁用。

EDIT:基于maxymoo建议的完整代码示例。

如何做到这一点?

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import fetch_20newsgroups
# setup the featureunion
wordvect = TfidfVectorizer(analyzer='word')
lettervect = CountVectorizer(analyzer='char')
featureunionvect = FeatureUnion([("lettervect", lettervect), ("wordvect", wordvect)])
# setup the pipeline
classifier = LogisticRegression(class_weight='balanced')
pipeline = Pipeline([('vect', featureunionvect), ('classifier', classifier)])
# gridsearch parameters 
parameters = {
            'vect__wordvect__ngram_range': [(1, 1), (1, 2)],  # commenting out these two lines
            'vect__lettervect__lowercase': [True, False],     # runs, but there is no parameterization anymore
            'vect__transformer_list': [[('wordvect', wordvect)],
                                        [('lettervect', lettervect)],
                                        [('wordvect', wordvect), ('lettervect', lettervect)]]}
gs_clf = GridSearchCV(pipeline, parameters)
# data
newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])
# gridsearch CV
gs_clf = GridSearchCV(pipeline, parameters)
gs_clf = gs_clf.fit(newsgroups_train.data, newsgroups_train.target)
for score in gs_clf.grid_scores_:
    print "gridsearch scores: ", score

FeatureUnion有一个名为transformer_list的参数,您可以使用它来网格搜索;所以在你的情况下,你的网格搜索参数会变成

parameters = {'vect__wordvect__ngram_range': [(1, 1), (1, 2)],
              'vect__lettervect__lowercase': [True, False],
              'vect__transformer_weights': [{"lettervect":1,"wordvect":0}, 
                                            {"lettervect":0,"wordvect":1}, 
                                            {"lettervect":1,"wordvect":1}]}

相关内容

  • 没有找到相关文章