熊猫 - 将空值替换为组中最常见的值



我有一个包含以下列的数据集:

['sex', 'age', 'relationship_status]

"relationship_status"列中有一些 NaN 值,我想根据年龄和性别将它们替换为每个组中最常见的值。

我知道如何分组和计算值:

df2.groupby(['age','sex'])['relationship_status'].value_counts()

它返回:

age   sex     relationship_status
17.0  female  Married with kids       1
18.0  female  In relationship         5
Married                 4
Single                  4
Married with kids       2
male    In relationship         9
Single                  5
Married                 4
Married with kids       4
Divorced                3
.
.
.
86.0  female  In relationship         1
92.0  male    Married                 1
97.0  male    In relationship         1

所以再一次,我需要实现的是,每当"relationship_status"为空时,我需要程序用基于人员年龄和性别的最常见值替换它。

谁能建议我该怎么做?

亲切问候。

像这样:

mode = df2.groupby(['age','sex'])['relationship_status'].agg(lambda x: pd.Series.mode(x)[0])
df2['relationship_status'].fillna(mode, inplace=True)

检查一下,当(年龄,性别(子组中只有nans时,它会返回"ALL_NAN":

import pandas as pd
df = pd.DataFrame(
{'age': [25, 25, 25, 25, 25, 25,],
'sex': ['F', 'F', 'F', 'M', 'M', 'M', ],
'status': ['married', np.nan, 'married', np.nan, np.nan, 'single']
})

df.loc[df['status'].isna(), 'status'] = df.groupby(['age','sex'])['status'].transform(lambda x: x.mode()[0] if any(x.mode()) else 'ALL_NAN')

输出:

age sex   status
0   25   F  married
1   25   F  married
2   25   F  married
3   25   M   single
4   25   M   single
5   25   M   single

最新更新