我编写了多个步骤来估算数据集,我想pickle/保存这些步骤,以便在分析新样本时可以自动加载和使用。
我为imputation所做的步骤是:
imputer = MissForest()
imputed_data = imputer.fit_transform(data)
imputed_data = pd.DataFrame(imputed_data, columns=data.columns)
#Drop 'id'
imputed_data_initial = imputed_data.drop('id', axis = 1)
#Get unique values
def get_unique_values(col_name):
return data[col_name].dropna().unique().tolist()
#Find closest distance
def find_closest_value(target, unique_values):
chosen = unique_values[0]
L2 = (target - chosen) ** 2
for value in unique_values:
if (target - value) ** 2 < L2:
chosen = value
L2 = (target - chosen) ** 2
return chosen
#Imputation
for col_name in columns_name_lst:
columns_name_lst = imputed_data.columns
row_count = len(imputed_data)
unique_values = get_unique_values(col_name)
if len(unique_values) < 2000:
for i in range(row_count):
target = imputed_data.iloc[i][col_name]
imputed_data.iloc[i][col_name] = find_closest_value(target, unique_values)
我想把所有这些步骤作为一个整体来处理。我可以用python做什么?谢谢!
您可以使用sklearn. pipeline . pipeline。创建一个列表,并将每一步作为元组传递。这里的描述:https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
保存您的管道可能有点困难。请检查这个问题:如何将scikit-learn管道与keras回归器内部保存到磁盘?