Python中的逻辑回归:GridSearchCV不起作用



我对数据挖掘,尤其是文本分析有点陌生。我正在我的数据集上训练一个逻辑回归模型,我正在努力获得尽可能好的精度,至少在0.6左右。但我似乎无法达到0.5以上。以下是我的数据集:

df = pd.read_csv('https://raw.githubusercontent.com/cpedroni/DMML2021_Microsoft/main/data/training_data.csv')
df_pred = pd.read_csv('https://raw.githubusercontent.com/cpedroni/DMML2021_Microsoft/main/data/unlabelled_test_data.csv')

我使用一个带有tfidf:的管道来训练我的模型

tfidf_params = dict(sublinear_tf= True, 
min_df = 4, 
norm= 'l2', 
ngram_range= (1,4),
tokenizer= word_tokenize)
clf = Pipeline(steps=[
('features', TfidfVectorizer(**tfidf_params)),
('model', LogisticRegression(random_state=0, solver='lbfgs', max_iter=300))
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
def metrics(y_test, y_pred):
precision = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)
f1 = f1_score(y_test, y_pred, average=None)
print("Precision: " + str(precision_score(y_test, y_pred, average='micro')))
print("Recall: " + str(recall_score(y_test, y_pred, average='micro')))
print("F1: " + str(2 * (precision * recall) / (precision + recall)))
print("Accuracy: " + str(accuracy_score(y_test, y_pred)))
metrics(y_test, y_pred)

我得到了0.471875的准确度分数,但我想让它更高,所以我试着做一个网格搜索,比如:

from sklearn.model_selection import GridSearchCV
param_grid_lr = {
'max_iter': [20, 50, 100, 200, 500, 1000],                      
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],   
'class_weight': ['balanced']                                    
}
logModel_grid = GridSearchCV(estimator=LogisticRegression(random_state=1234), param_grid=param_grid_lr, verbose=1, cv=10, n_jobs=-1)
logModel_grid.fit(X_train, y_train)
print(logModel_grid.best_estimator_)

然而,我得到了一个我不理解的错误:ValueError: could not convert string to float。我在logModel_grid.fit(X_train, y_train)行中得到了这个错误,但在进行网格搜索之前,我没有在日志模型中得到这个错误。你知道为什么做GridSearchCV会导致这个错误吗?

您需要在估计器中包含向量器。

假设你是这样处理的:

from sklearn.model_selection import train_test_split, GridSearchCV
from nltk.tokenize import word_tokenize
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('https://raw.githubusercontent.com/cpedroni/DMML2021_Microsoft/main/data/training_data.csv')
X_train, X_test, y_train, y_test = train_test_split(df['sentence'],df['difficulty'],
test_size=0.2, random_state=30, stratify=df['difficulty'])

我们使用的管道就像您的第一部分:

tfidf_params = dict(sublinear_tf= True, 
min_df = 4, 
norm= 'l2', 
ngram_range= (1,4),
tokenizer= word_tokenize)
pipe = Pipeline(steps=[
('features', TfidfVectorizer(**tfidf_params)),
('model', LogisticRegression())
])

定义参数,而不是下划线:

param_grid_lr = {
'model__max_iter': [20, 50],                      
'model__solver': ['newton-cg', 'lbfgs'],   
'model__class_weight': ['balanced']                                    
}

适合:

logModel_grid = GridSearchCV(pipe, param_grid=param_grid_lr, verbose=1, cv=10, n_jobs=-1)
logModel_grid.fit(X_train, y_train)