sklearn的管道可以用于平衡和拆分数据集吗?



我一直无法找到任何教程,指南或示例代码来执行数据集分割和平衡作为sklearn管道的一部分。这可能吗?

我有这样的东西:

from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 
### can this be part of the pipeline?
X_train, X_test, y_train, y_test =  
train_test_split(df, df['target'].values, stratify=df['target'].values, test_size=0.7, random_state=42) 
###:end can this be part of the pipeline?
pipeline = Pipeline([ 
# is there a splitter or balancer class that can be added to the pipeline here?
('scaler', StandardScaler()), 
('K Nearest Neighbor', KNeighborsClassifier(n_neighbors=4)) 
]) 
pipeline.fit(X_train, y_train)

有可能用这样的管道代替吗?:

pipeline = Pipeline([ 
('balancer', Balancer()), # is there some magical Balancer() class somewhere?
('splitter', Splitter()), # is there some magical Splitter() class somewhere?
('scaler', StandardScaler()), 
('K Nearest Neighbor', KNeighborsClassifier(n_neighbors=4)) 
]) 

感谢您的宝贵时间🙏

.

Pipeline对象的目的是将处理数据的几个步骤的固定序列和最终估计器组合在一起。

Pipeline对象只对观测数据进行变换,通常用X表示。同样涉及目标(通常用y表示)的转换不能是管道的一部分。


并且拾取有关交叉验证的评论,Pipeline确实意味着与估计器一起交叉验证数据处理步骤,但不是作为Pipeline对象本身的一部分:

from sklearn.model_selection import cross_validate 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 

pipeline = Pipeline([ 
('scaler', StandardScaler()), 
('K Nearest Neighbor', KNeighborsClassifier(n_neighbors=4)) 
])
cv_results = cross_validate(pipeline, X, y, cv=3)