我正在尝试应用标准化,然后使用KNN进行imputation。然后我想对这些值进行反向转换,因为我将应用一些需要原始数据的其他转换。在scikit-learn管道中有可能做到这一点吗?无论我怎么尝试,我得到一个错误。
注意:逆变换应在管道内进行,而不是在管道完成后进行。
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
ss = StandardScaler()
imputer = KNNImputer(n_neighbors=3, add_indicator=False)
ohe = OneHotEncoder()
df_example = pd.DataFrame(data={"num1":[1, 2, 3, np.nan, 6, 6, 9, 4, 5],
"num2":[4, np.nan, 6, 5, 3, 8, 2, 8, 3],
"cat1":['A', 'B', 'C', 'A', 'B', 'C', 'A', 'A', 'B']})
list_numeric_vars = ["num1", "num2"]
list_cat_vars = ["cat1"]
pipeline_num = Pipeline([
("standardizer", ss),
("imputer", imputer),
("standardizer_inverse", FunctionTransformer(ss.inverse_transform))
])
pipeline_cat = Pipeline([
("ohe", ohe),
])
ct = ColumnTransformer(
transformers =
[
("pipeline_num", pipeline_num, list_numeric_vars),
("pipeline_cat", pipeline_cat, list_cat_vars)
],
remainder ="drop"
)
ct.fit(df_example) # Error
由于标准标度器和KNN输入器(n个最近邻的平均值)是线性操作,因此运行standardizer >> imputer >> inverse_standardizer
产生的结果与单独运行imputer
相同。
你可以简化你的数字管道如下:
pipeline_num = Pipeline([
("imputer", imputer),
# Add other processing steps here
])
这是"proof"单独的输入操作产生相同的结果:
df1 = ss.fit_transform(df_example[list_numeric_vars])
df1 = imputer.fit_transform(df1)
df1 = ss.inverse_transform(df1)
print(f'Scale/Impute/Inverse-Scale:n{df1}n')
df2 = imputer.fit_transform(df_example[list_numeric_vars])
print(f'Impute Only:n{df2}n')
输出如下:
Scale/Impute/Inverse-Scale:
[[1. 4. ]
[2. 6. ]
[3. 6. ]
[3.33333333 5. ]
[6. 3. ]
[6. 8. ]
[9. 2. ]
[4. 8. ]
[5. 3. ]]
Impute Only:
[[1. 4. ]
[2. 6. ]
[3. 6. ]
[3.33333333 5. ]
[6. 3. ]
[6. 8. ]
[9. 2. ]
[4. 8. ]
[5. 3. ]]