删除基于价值的其他列行:熊猫groupby



假设我有以下数据帧:

Id       title                  field  parent_field            label
1   Hardwareentwickler (m/w)    4036     2172           Hardware-Entwicklung, Engineering
1   Hardwareentwickler (m/w)    2172     NaN            IT, EDV
1   Hardwareentwickler (m/w)    3081     NaN            Technik, Ingenieurwesen
1   Hardwareentwickler (m/w)    1000     NaN            Consultant
1   Hardwareentwickler (m/w)    5000    1000            IT Consultant
2   Accountant                  4321     NaN            Finanz
2   Accountant                  1234     NaN            Kostenrechner

我想做的是删除所有记录,其中field存在于列parent_field(按ID/标题分组)。

最后,我想得到以下结果:

Id       title                  field  parent_field            label
1   Hardwareentwickler (m/w)    4036     2172           Hardware-Entwicklung, Engineering
1   Hardwareentwickler (m/w)    3081     NaN            Technik, Ingenieurwesen
1   Hardwareentwickler (m/w)    5000    1000            IT Consultant
2   Accountant                  4321     NaN            Finanz
2   Accountant                  1234     NaN            Kostenrechner

,行

Id       title                  field  parent_field            label
1   Hardwareentwickler (m/w)    2172     NaN            IT, EDV
1   Hardwareentwickler (m/w)    1000     NaN            Consultant
删除

,因为field值在其他行(按id分组)的parent_field中表示

简单解决方法

将数据框按idtitleapply分组,使用lambda函数检查field是否出现在parent_field中以创建布尔值mask

mask = df.groupby(['Id', 'title'], group_keys=False)
.apply(lambda x: x['field'].isin(x['parent_field']))

>>> df[~mask]
Id                     title  field  parent_field                              label
0   1  Hardwareentwickler (m/w)   4036        2172.0  Hardware-Entwicklung, Engineering
2   1  Hardwareentwickler (m/w)   3081           NaN            Technik, Ingenieurwesen
4   1  Hardwareentwickler (m/w)   5000        1000.0                      IT Consultant
5   2                Accountant   4321           NaN                             Finanz
6   2                Accountant   1234           NaN                      Kostenrechner

使用经典连接方法的少一行代码:

df_copy = df.copy()
df_copy ['marker'] = 1
# left join to add a marker where the macthed condition.
joined = pd.merge(df, df_copy[['parent_field','marker']], left_on='field', rigth_on='parent_field', how='left')
# drop where is marker == 1 so that field and parent field are matched.
joined[joined.marker.isna()]

的好处是你有一个标记,所以你可以做一个基于丢弃在同一df上的分析。

如果我理解正确的话,让我们假设您的数据框架名为'df'

df[~df['field'].isin(list(df['parent_field']))]

最新更新