似乎无法将pandas DataFrame传递到feature_engine.selection.DropHighPSI



我无法获得计算psi值的代码,并且我对feature_engine库或与ML相关的一般操作不太熟悉。

我目前尝试运行的代码是:

long_list = merge_into_df(oot_path, test_path, train_path, key_mapping_path)
long_list.drop(columns=['Unnamed: 0_x', 'CLIENT_ID', 'SET'], inplace=True)
long_list['REF_DATE'] = pd.to_datetime(long_list.REF_DATE)
print(long_list.head())
transformer = DropHighPSIFeatures(
cut_off=pd.to_datetime("2019/09/30"),  # the cut_off date
split_col='REF_DATE',  # the date variable
strategy='equal_frequency',
bins=8,
threshold=0.1,
missing_values='ignore'
)
transformer.fit_transform(long_list)
return transformer.psi_values_

返回的错误消息是:

Traceback (most recent call last):
File "C:UsersDellPipelinemodelling.py", line 124, in <module>
test()
File "C:UsersDellPipelinemodelling.py", line 98, in test
File "C:ProgramDataMiniconda3libsite-packagesfeature_engineselectiondrop_psi_features.py", line 364, in fit
test_discrete = bucketer.transform(test_df[[feature]].dropna())
File "C:ProgramDataMiniconda3libsite-packagesfeature_enginediscretisationbase_discretiser.py", line 74, in transform
X = super().transform(X)
File "C:ProgramDataMiniconda3libsite-packagesfeature_enginebase_transformers.py", line 146, in transform
X = check_X(X)
File "C:ProgramDataMiniconda3libsite-packagesfeature_enginedataframe_checks.py", line 82, in check_X
raise ValueError(
ValueError: 0 feature(s) (shape=(0, 1)) while a minimum of 1 is required.

上一个代码片段中的数据帧打印语句是:

ID  TARGET  GROUP_ID  BRANCH_ID  ...  SON_4_12AY_7_12AY_EKOD_1  SON_4_12AY_7_12AY_EKOD_U  Unnamed: 0_y   REF_DATE
0   0       0         0       1020  ...                         0                         0             0 2016-12-31
1   2       0         0       2280  ...                         0                         0             2 2016-12-31
2   3       0         0       1150  ...                         0                         0             3 2016-12-31
3   4       1         0       1000  ...                         0                         0             4 2016-12-31
4   5       0         0       1090  ...                         0                         0             5 2016-12-31
[5 rows x 1976 columns]

所以我认为数据帧本身没有任何问题(除了Unnamed:0_y列(

然而,为了以防万一,我从3个长列表csv文件和一个密钥映射csv文件创建数据帧的方法是:

train_df = pd.read_csv(train_path, low_memory=False)
test_df = pd.read_csv(test_path, low_memory=False)
oot_df = pd.read_csv(oot_path, low_memory=False)
key_mapping_df = pd.read_csv(key_mapping_path)
long_list_df = pd.concat([train_df, test_df, oot_df], axis=0)
long_list_final_df = long_list_df.merge(key_mapping_df, on="ID", how="inner", sort=True)
return long_list_final_df

发现问题是由DataFrame(long_list(上的数据太稀疏(NaN值太多(或太大引起的。我还没有做实验来弄清楚是哪一个,但当我删除包含大量NaN值的列时,问题就解决了。

最新更新