我一直无法找到任何教程,指南或示例代码来执行数据集分割和平衡作为sklearn管道的一部分。这可能吗?
我有这样的东西:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
### can this be part of the pipeline?
X_train, X_test, y_train, y_test =
train_test_split(df, df['target'].values, stratify=df['target'].values, test_size=0.7, random_state=42)
###:end can this be part of the pipeline?
pipeline = Pipeline([
# is there a splitter or balancer class that can be added to the pipeline here?
('scaler', StandardScaler()),
('K Nearest Neighbor', KNeighborsClassifier(n_neighbors=4))
])
pipeline.fit(X_train, y_train)
有可能用这样的管道代替吗?:
pipeline = Pipeline([
('balancer', Balancer()), # is there some magical Balancer() class somewhere?
('splitter', Splitter()), # is there some magical Splitter() class somewhere?
('scaler', StandardScaler()),
('K Nearest Neighbor', KNeighborsClassifier(n_neighbors=4))
])
感谢您的宝贵时间🙏
不.
Pipeline
对象的目的是将处理数据的几个步骤的固定序列和最终估计器组合在一起。
而Pipeline
对象只对观测数据进行变换,通常用X
表示。同样涉及目标(通常用y
表示)的转换不能是管道的一部分。
并且拾取有关交叉验证的评论,Pipeline
确实意味着与估计器一起交叉验证数据处理步骤,但不是作为Pipeline
对象本身的一部分:
from sklearn.model_selection import cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('scaler', StandardScaler()),
('K Nearest Neighbor', KNeighborsClassifier(n_neighbors=4))
])
cv_results = cross_validate(pipeline, X, y, cv=3)