使用 Sklearn 插补'numerical'列更改为'object'(除了填充缺失的数据)



在输入之前,我在"X_train"中有数值列:numerical_cols = [col] for X_train中的col。列如果X_train[col]。type in ['int64','float64']] numerical_cols

在输入之后,在新的数据帧&;imputed_x_train_missing &;中没有更多的数字列,所有的数字列现在都是'object'。这是应用XGBRegressor时的一个潜在问题。

这是我的代码:

X_valid_missing = X_valid.copy()
my_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
my_imputer.fit(X_train_missing)
imputed_X_train_missing = pd.DataFrame(my_imputer.transform(X_train_missing))
imputed_X_valid_missing = pd.DataFrame(my_imputer.transform(X_valid_missing))
imputed_X_train_missing.columns = X_train_missing.columns
imputed_X_valid_missing.columns = X_valid_missing.columns ```

这可能被认为是治疗症状,而不是原因,但你可以改变结果dtypes数值数据类型。

Using astype() - Pandas: convert dtype 'object'int

使用to_numeric() - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html

当其中一列是'object'时,问题是输入器。归并后,所有列的结果为'object':

import pandas as pd
from sklearn.impute import SimpleImputer
X_train = [['dddd', 2, 3], ['dddd', np.nan, 6], ['dddd', 5, 9]]
X_test = [[np.nan, 2, 3], ['dddd', np.nan, 6], ['dddd', np.nan, 9]]
col_names = ['c1', 'c2', 'c3']
df_x_train = pd.DataFrame(X_train, columns=col_names)
df_x_test = pd.DataFrame(X_test, columns=col_names)
print(df_x_train.info())

RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 3 non-null object
1 c2 2 non-null float64
2 c3 3 non-null int64
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp.fit(df_x_train)
imputed_x_train = pd.DataFrame(imp.transform(df_x_train))
imputed_x_train.dtypes`
Now all the columns result object:
0 object
1 object
2 object
dtype: object```

最新更新