GridSearch搜索最佳模型:保存并加载参数

我喜欢运行以下工作流：

选择用于文本矢量化的模型
定义参数列表
在参数上应用带有GridSearchCV的管道，使用LogisticRegression()作为基线来查找最佳模型参数
保存最佳模型(参数)
加载最好的模型参数，这样我们就可以在这个定义的模型上应用一系列其他分类器

以下是您可以复制的代码：

网格搜索：

%%time
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=0)
# Find best Tfidf model using LR
pipeline = Pipeline([
('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
('clf', LogisticRegression())
])
parameters = {
'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],
'tfidf__smooth_idf': (True, False),
'tfidf__norm': ('l1', 'l2', None),
}
grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)
print(grid.best_params_)
# Save model
#joblib.dump(grid.best_estimator_, 'best_tfidf.pkl', compress = 1) # this unfortunately includes the LogReg
joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

为24个候选者中的每一个拟合2个折叠，总共拟合48个｛'tfidf_smooth_idf'：True，'tfidf _norm'：'l2'，'tfid _max_df'：0.25｝

具有最佳参数的负载模型：

from sklearn.model_selection import GridSearchCV
# Load best parameters
tfidf_params = joblib.load('best_tfidf.pkl')
pipeline = Pipeline([
('vec', TfidfVectorizer(preprocessor=' '.join, tokenizer=None).set_params(**tfidf_params)), # here is the issue?
('clf', LogisticRegression())
])
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print("Cross-Validation Score: %s" % (np.mean(cval)))

ValueError：估计量的参数tfidf无效TfidfVectorizer(analyzer='word'，binary=False，decode_error='strict'，dtype=，encoding='utf-8'，input='content'，小写=真，max_df=1.0，max_features=无，min_df=1，ngram_range=(1，1)，范数='l2'，预处理器=，smooth_idf=真，stop_words=无，strip_accents=无，subliner_tf=False，token_pattern='(？u)\b\w\w+\b'，tokenizer=无，use_idf=真，词汇表=无)。使用estimator.get_params().keys()检查可用参数列表。

问题：

如何加载Tfidf模型的最佳参数

此行：

joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

保存pipeline的参数，而不是TfidfVectorizer的参数。这样做：

pipeline = Pipeline([
# Change the name to be same as before
('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
('clf', LogisticRegression())
])
pipeline.set_params(**tfidf_params)

相关内容

最新更新

热门标签：