在处理ML模型中丢失的数据时,如何将估算数据集的fit_tranform与原始数据集相匹配



当尝试使用KNN计算机算法使用以下代码行填充缺失值时:

pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.columns)

我收到错误消息:

Traceback (most recent call last):
File "c:UsersmynameDesktopProjectPythonToolcalculatordatabase-analyzerdatabase_analyzer.py", line 384, in <module>
main()
File "c:UsersmynameDesktopProjectPythonToolcalculatordatabase-analyzerdatabase_analyzer.py", line 232, in main
train_data_engineered = missingvalue_handler(train_data_engineered)
File "c:UsersmynameDesktopProjectPythonToolcalculatordatabase-analyzerutilities_module.py", line 1268, in missingvalue_handler
return pd.DataFrame(knn_imputer.fit_transform(new_data),
File "C:ProgramDataAnaconda3envstflibsite-packagespandascoreframe.py", line 695, in __init__
mgr = ndarray_to_mgr(
File "C:ProgramDataAnaconda3envstflibsite-packagespandascoreinternalsconstruction.py", line 351, in ndarray_to_mgr    
_check_values_indices_shape_match(values, index, columns)
File "C:ProgramDataAnaconda3envstflibsite-packagespandascoreinternalsconstruction.py", line 422, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (196, 1032), indices imply (196, 1033)

我知道这是因为估算者实际上估算了一列,将它们从1033降到了1032。在不知道删除了哪一列的情况下,如何解决此问题?

我真的想明白了。我不需要知道确切的列名。我做了以下更改,以确保data.shape[1]和len(data.columns(在从估算数据集生成pandas数据帧时匹配。

pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.columns)

pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.dropna(axis=1, how='all').columns)

最新更新