我对数据挖掘,尤其是文本分析有点陌生。我正在我的数据集上训练一个逻辑回归模型,我正在努力获得尽可能好的精度,至少在0.6左右。但我似乎无法达到0.5以上。以下是我的数据集:
df = pd.read_csv('https://raw.githubusercontent.com/cpedroni/DMML2021_Microsoft/main/data/training_data.csv')
df_pred = pd.read_csv('https://raw.githubusercontent.com/cpedroni/DMML2021_Microsoft/main/data/unlabelled_test_data.csv')
我使用一个带有tfidf:的管道来训练我的模型
tfidf_params = dict(sublinear_tf= True,
min_df = 4,
norm= 'l2',
ngram_range= (1,4),
tokenizer= word_tokenize)
clf = Pipeline(steps=[
('features', TfidfVectorizer(**tfidf_params)),
('model', LogisticRegression(random_state=0, solver='lbfgs', max_iter=300))
])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
def metrics(y_test, y_pred):
precision = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)
f1 = f1_score(y_test, y_pred, average=None)
print("Precision: " + str(precision_score(y_test, y_pred, average='micro')))
print("Recall: " + str(recall_score(y_test, y_pred, average='micro')))
print("F1: " + str(2 * (precision * recall) / (precision + recall)))
print("Accuracy: " + str(accuracy_score(y_test, y_pred)))
metrics(y_test, y_pred)
我得到了0.471875
的准确度分数,但我想让它更高,所以我试着做一个网格搜索,比如:
from sklearn.model_selection import GridSearchCV
param_grid_lr = {
'max_iter': [20, 50, 100, 200, 500, 1000],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'class_weight': ['balanced']
}
logModel_grid = GridSearchCV(estimator=LogisticRegression(random_state=1234), param_grid=param_grid_lr, verbose=1, cv=10, n_jobs=-1)
logModel_grid.fit(X_train, y_train)
print(logModel_grid.best_estimator_)
然而,我得到了一个我不理解的错误:ValueError: could not convert string to float
。我在logModel_grid.fit(X_train, y_train)
行中得到了这个错误,但在进行网格搜索之前,我没有在日志模型中得到这个错误。你知道为什么做GridSearchCV
会导致这个错误吗?
您需要在估计器中包含向量器。
假设你是这样处理的:
from sklearn.model_selection import train_test_split, GridSearchCV
from nltk.tokenize import word_tokenize
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
df = pd.read_csv('https://raw.githubusercontent.com/cpedroni/DMML2021_Microsoft/main/data/training_data.csv')
X_train, X_test, y_train, y_test = train_test_split(df['sentence'],df['difficulty'],
test_size=0.2, random_state=30, stratify=df['difficulty'])
我们使用的管道就像您的第一部分:
tfidf_params = dict(sublinear_tf= True,
min_df = 4,
norm= 'l2',
ngram_range= (1,4),
tokenizer= word_tokenize)
pipe = Pipeline(steps=[
('features', TfidfVectorizer(**tfidf_params)),
('model', LogisticRegression())
])
定义参数,而不是下划线:
param_grid_lr = {
'model__max_iter': [20, 50],
'model__solver': ['newton-cg', 'lbfgs'],
'model__class_weight': ['balanced']
}
适合:
logModel_grid = GridSearchCV(pipe, param_grid=param_grid_lr, verbose=1, cv=10, n_jobs=-1)
logModel_grid.fit(X_train, y_train)