我有一个包含以下列的数据集:
['sex', 'age', 'relationship_status]
"relationship_status"列中有一些 NaN 值,我想根据年龄和性别将它们替换为每个组中最常见的值。
我知道如何分组和计算值:
df2.groupby(['age','sex'])['relationship_status'].value_counts()
它返回:
age sex relationship_status
17.0 female Married with kids 1
18.0 female In relationship 5
Married 4
Single 4
Married with kids 2
male In relationship 9
Single 5
Married 4
Married with kids 4
Divorced 3
.
.
.
86.0 female In relationship 1
92.0 male Married 1
97.0 male In relationship 1
所以再一次,我需要实现的是,每当"relationship_status"为空时,我需要程序用基于人员年龄和性别的最常见值替换它。
谁能建议我该怎么做?
亲切问候。
像这样:
mode = df2.groupby(['age','sex'])['relationship_status'].agg(lambda x: pd.Series.mode(x)[0])
df2['relationship_status'].fillna(mode, inplace=True)
检查一下,当(年龄,性别(子组中只有nans时,它会返回"ALL_NAN":
import pandas as pd
df = pd.DataFrame(
{'age': [25, 25, 25, 25, 25, 25,],
'sex': ['F', 'F', 'F', 'M', 'M', 'M', ],
'status': ['married', np.nan, 'married', np.nan, np.nan, 'single']
})
df.loc[df['status'].isna(), 'status'] = df.groupby(['age','sex'])['status'].transform(lambda x: x.mode()[0] if any(x.mode()) else 'ALL_NAN')
输出:
age sex status
0 25 F married
1 25 F married
2 25 F married
3 25 M single
4 25 M single
5 25 M single