假设我有以下数据帧:
Id title field parent_field label
1 Hardwareentwickler (m/w) 4036 2172 Hardware-Entwicklung, Engineering
1 Hardwareentwickler (m/w) 2172 NaN IT, EDV
1 Hardwareentwickler (m/w) 3081 NaN Technik, Ingenieurwesen
1 Hardwareentwickler (m/w) 1000 NaN Consultant
1 Hardwareentwickler (m/w) 5000 1000 IT Consultant
2 Accountant 4321 NaN Finanz
2 Accountant 1234 NaN Kostenrechner
我想做的是删除所有记录,其中field
存在于列parent_field
(按ID/标题分组)。
最后,我想得到以下结果:
Id title field parent_field label
1 Hardwareentwickler (m/w) 4036 2172 Hardware-Entwicklung, Engineering
1 Hardwareentwickler (m/w) 3081 NaN Technik, Ingenieurwesen
1 Hardwareentwickler (m/w) 5000 1000 IT Consultant
2 Accountant 4321 NaN Finanz
2 Accountant 1234 NaN Kostenrechner
,行
Id title field parent_field label
1 Hardwareentwickler (m/w) 2172 NaN IT, EDV
1 Hardwareentwickler (m/w) 1000 NaN Consultant
删除,因为field
值在其他行(按id分组)的parent_field
中表示
简单解决方法
将数据框按id
和title
和apply
分组,使用lambda函数检查field
是否出现在parent_field
中以创建布尔值mask
。
mask = df.groupby(['Id', 'title'], group_keys=False)
.apply(lambda x: x['field'].isin(x['parent_field']))
>>> df[~mask]
Id title field parent_field label
0 1 Hardwareentwickler (m/w) 4036 2172.0 Hardware-Entwicklung, Engineering
2 1 Hardwareentwickler (m/w) 3081 NaN Technik, Ingenieurwesen
4 1 Hardwareentwickler (m/w) 5000 1000.0 IT Consultant
5 2 Accountant 4321 NaN Finanz
6 2 Accountant 1234 NaN Kostenrechner
使用经典连接方法的少一行代码:
df_copy = df.copy()
df_copy ['marker'] = 1
# left join to add a marker where the macthed condition.
joined = pd.merge(df, df_copy[['parent_field','marker']], left_on='field', rigth_on='parent_field', how='left')
# drop where is marker == 1 so that field and parent field are matched.
joined[joined.marker.isna()]
的好处是你有一个标记,所以你可以做一个基于丢弃在同一df上的分析。
如果我理解正确的话,让我们假设您的数据框架名为'df'
df[~df['field'].isin(list(df['parent_field']))]