根据第二个DF中给出的防御,从pd.DataFrame中删除行



我有两个数据帧xy。DFx包含两个分组变量SA以及一个值变量V。我想从这个DF中删除y中定义的组对(S,A(的行。DFy包含变量S和变量A_D,它们一起定义了需要从x删除哪个(S,a(对。

但是,y['A_D']中的每个元素(可以(都包含一个字符串,由A中的元素组成,用逗号分隔。这些单独的元素中的每一个都应该从x中删除(对于S中的特定元素(。此外,如果y['A_D']包含元素all,则对于S中的特定元素,它应该删除x中的整个S-群。

我找到了一个能满足我需求的解决方案,但我的问题是,有没有更简单或更Python的方法来解决这个问题?

import pandas as pd
import numpy as np
# Define x
x = pd.DataFrame({'S': np.repeat(['s1','s2','s3'], 5),
'A': [j for i in range(3) for j in ['a','b','c','d','e']],
'V': np.random.uniform(size=15) })
# Define y. Which (S,A) pairs should be deleted from x. For 's1' all rows should be deleted.
# For 's2' 'a' and 'd' rows should be deleted and for 's3' the 'c' row should be deleted.
y = pd.DataFrame({'S':['s1','s2','s3'],
'A_D':['all','a, d', 'c']})
# My solution:
# expand y to a new DF z. Comma separated elements in 'A_D' become separate elements. Also strip whitespace.
z = []
for i, r in y.iterrows():
z.append(pd.DataFrame({'S'  : r[0],
'A_D': [u.strip() for u in str(r[1]).split(',')]}))
z = pd.concat(z)
# first delete S-groups defined by `all`
x_d = x.merge(z[z['A_D']=='all'],how='left')
x_d = x_d[x_d['A_D']!='all'].drop(columns= 'A_D')
# then drop (S,A) pairs.
x_d = x_d.merge(z[z['A_D']!='all'],how='left', left_on = ['S','A'], right_on = ['S', 'A_D'])
x_d = x_d[pd.isna(x_d['A_D'])].drop(columns= 'A_D').reset_index(drop=True)
# The required result:
print(x_d)

为了清晰起见,物体看起来像这样:

x
Out[1]: 
S  A         V
0   s1  a  0.758516
1   s1  b  0.522200
2   s1  c  0.190511
3   s1  d  0.544617
4   s1  e  0.480378
5   s2  a  0.191016
6   s2  b  0.714625
7   s2  c  0.852788
8   s2  d  0.142410
9   s2  e  0.909382
10  s3  a  0.895031
11  s3  b  0.153444
12  s3  c  0.751675
13  s3  d  0.227501
14  s3  e  0.586467
y
Out[2]: 
S   A_D
0  s1   all
1  s2  a, d
2  s3     c
z
Out[3]: 
S  A_D
0  s1  all
0  s2    a
1  s2    d
0  s3    c
x_d
Out[4]: 
S  A         V
0  s2  b  0.714625
1  s2  c  0.852788
2  s2  e  0.909382
3  s3  a  0.895031
4  s3  b  0.153444
5  s3  d  0.227501
6  s3  e  0.586467
x
###
S  A         V
0   s1  a  0.490194
1   s1  b  0.875381
2   s1  c  0.384808
3   s1  d  0.063960
4   s1  e  0.003159
5   s2  a  0.188624
6   s2  b  0.400527
7   s2  c  0.137458
8   s2  d  0.162291
9   s2  e  0.337899
10  s3  a  0.101296
11  s3  b  0.464031
12  s3  c  0.407629
13  s3  d  0.222498
14  s3  e  0.802472

', '' ,'','分离的

y
###
S      A_D
0  s1      all
1  s2  a, d ,c
2  s3      c,d

y['A_D'] = y['A_D'].replace('all', ', '.join(x['A'].unique()))
y = y.assign(A_D=y['A_D'].str.split(',')).explode('A_D')
y['A_D'] = y['A_D'].str.strip()
output = x[~x.set_index(['S','A']).index.isin(y.set_index(['S','A_D']).index)].reset_index(drop=True)
output
###
S  A         V
0  s2  b  0.400527
1  s2  e  0.337899
2  s3  a  0.101296
3  s3  b  0.464031
4  s3  e  0.802472

这是我的解决方案,它至少更短:(

def filter_group(group, filter_rule):
return (None if filter_rule == 'all'
else group[~group["A"].isin(filter_rule.replace(' ', '').split(','))])
x_d = pd.concat(filter_group(x.groupby('S').get_group(grp), filter_rule) 
for grp, filter_rule in dict(zip(y["S"], y["A_D"])).items())