在不使用 sklearn 管道的情况下获得与 sklearn 管道相同的结果

如何在不使用管道的情况下正确标准化数据？我只是想确保我的代码正确并且没有数据泄漏。

因此，如果我在项目开始时对整个数据集进行一次标准化，然后继续尝试使用不同的 ML 算法进行不同的 CV 测试，这是否与创建 Sklearn 管道并结合每个 ML 算法执行相同的标准化？

y = df['y']
X = df.drop(columns=['y', 'Date'])
scaler = preprocessing.StandardScaler().fit(X)
X_transformed = scaler.transform(X)
clf1 = DecisionTreeClassifier()
clf1.fit(X_transformed, y)
clf2 = SVC()
clf2.fit(X_transformed, y)
####Is this the same as the below code?####
pipeline1 = []
pipeline1.append(('standardize', StandardScaler()))
pipeline1.append(('clf1', DecisionTreeClassifier()))
pipeline1.fit(X_transformed,y)
pipeline2 = []
pipeline2.append(('standardize', StandardScaler()))
pipeline2.append(('clf2', DecisionTreeClassifier()))
pipeline2.fit(X_transformed,y)

为什么除了个人喜好之外，有人会选择后者？

它们是相同的。从可维护性的角度来看，您可能想要其中一个，但测试集预测的结果将是相同的。

编辑请注意，这只是因为StandardScaler是幂等的。奇怪的是，您将管道拟合在已经扩展的数据上......

相关内容

最新更新

热门标签：