KMeans和Logistic回归如何与Pipeline类中的MNIST数据集交互



我正在审查"与Scikit Learn,Keras&Tensorflow";书MNIST数据集的一种分类方法使用KMeans作为在使用LogsticRegression模型执行分类之前对数据集进行预处理的手段。

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)
pipeline = Pipeline([
("kmeans", KMeans(random_state=42)),
("log_reg", LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000, random_state=42)),
])
param_grid = dict(kmeans__n_clusters=range(45, 50))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)
predict = grid_clf.predict(X_test)

grid_clf.predict(X_test)的输出是原始数字(数字0-9(,而不是在管道中的KMeans步骤中创建的集群。我的问题是,grid_clf.predict()是如何将其预测与数据集上的原始标签联系起来的?

抛开网格搜索,代码

pipeline = Pipeline([
("kmeans", KMeans(n_clusters=45)),
("log_reg", LogisticRegression()),
])
pipeline.fit(X_train, y_train)

相当于:

kmeans = KMeans(n_clusters=45)
log_reg = LogisticRegression()
new_X_train = kmeans.fit_transform(X_train)
log_reg.fit(new_X_train, y_train) 

因此CCD_ 3被用于对训练数据进行变换。具有64个特征的原始数据被转换为具有45个特征的数据,45个特征由数据点到45个聚类中心的距离组成。然后,将该变换后的数据与原始标签一起用于拟合LogisticRegression

预测的工作方式相同:测试数据首先由KMeans转换,然后LogisticRegression与转换后的数据一起用于预测标签。因此,代替

predict = pipeline.predict(X_test)

可以使用:

predict = log_reg.predict(kmeans.transform(X_test))

最新更新