我是ML世界的新手,我正在尝试学习预处理。
我有一个结果数据,它有四种类型的输入:0、1、2、3、4
0表示无疾病,1 ~ 4表示不同类型的疾病。
我希望将它们二值化成两个:0代表"无病";以及那些1-4"有疾病"的人
我代码:
binarize_outcome['Outcome']=pd.cut(outcome_variable['Outcome'], bins=[0,1,4], labels=["no heart disease","heart diseases"])
binarize_outcome
输出:
0 NaN
1 no heart disease
2 no heart disease
3 NaN
4 NaN
...
299 no heart disease
300 no heart disease
301 no heart disease
302 NaN
Outcome 0 NaN
1 heart disease...
Name: Outcome, Length: 304, dtype: object
正如您所看到的,这不是我期望的输出,因为我的代码将0标记为NaN,其余的标记错误。
希望你能帮我解决这个问题。
提前感谢,艺术
您的条件是二进制的,所以您可以从numpy
中使用np.where
:
>>> import numpy as np
>>> df
Type
0 2
1 2
2 3
3 0
4 2
.. ...
95 2
96 4
97 0
98 0
99 1
[100 rows x 2 columns]
>>> df["Outcome"] = np.where(df == 0, "no heart disease", "heart disease")
>>> df
Type Outcome
0 2 heart disease
1 2 heart disease
2 3 heart disease
3 0 no heart disease
4 2 heart disease
.. ... ...
95 2 heart disease
96 4 heart disease
97 0 no heart disease
98 0 no heart disease
99 1 heart disease
[100 rows x 2 columns]
或与pd.cut
从pandas
:
>>> df["Outcome"] = pd.cut(df["Type"], [0, 0.9999999, 4],
labels=["no heart disease", "heart disease"],
include_lowest=True)
>>> df
Type Outcome
0 2 heart disease
1 2 heart disease
2 3 heart disease
3 0 no heart disease
4 2 heart disease
.. ... ...
95 2 heart disease
96 4 heart disease
97 0 no heart disease
98 0 no heart disease
99 1 heart disease
[100 rows x 2 columns]
与pd.IntervalIndex.from_breaks
:
>>> interval = pd.IntervalIndex.from_breaks([0, 1, 5], closed="left")
>>> df["Outcome"] = pd.cut(df["Type"], interval, include_lowest=True)
.cat.rename_categories(["no heart disease", "heart disease"])