我有以下数据帧,df:
Subject Marks1 Marks2
English 1 10
English 1.5 20
English 1.7 30
English 3 40
Science 1 10
Science 1.5 20
Science 1.7 15
Science 3 35
我想按主题分组并检查 Marks2 是否随着分数的增加而严格增加 1。如果没有,那么我想从 df 中删除该组并将其放在另一个问题数据帧中。所以最后我会有, DF:
Subject Marks1 Marks2
English 1 10
English 1.5 20
English 1.7 30
English 3 40
问题:
Subject Marks1 Marks2
Science 1 10
Science 1.5 20
Science 1.7 15
Science 3 35
使用DataFrameGroupBy.diff
比较较少的值,例如所有列的0
DataFrame.any
,然后按Series.isin
获取主题和过滤器输出的vals
:
m = df.groupby('Subject').diff().le(0).any(axis=1)
vals = df.loc[m, 'Subject']
mask = df['Subject'].isin(vals)
df1 = df[mask]
print (df1)
Subject Marks1 Marks2
4 Science 1.0 10
5 Science 1.5 20
6 Science 1.7 15
7 Science 3.0 35
df2 = df[~mask]
print (df2)
Subject Marks1 Marks2
0 English 1.0 10
1 English 1.5 20
2 English 1.7 30
3 English 3.0 40
编辑:瓶颈是每个组的差异,如果所有组都排序,可以通过以下方式提高性能:
#columns used for difference (passed to groupby())
cols = ['Subject','col1','col2']
#sorting by all columns (if possible and if necessary)
df = df.sort_values(cols)
m = df[['Marks1','Marks2']].diff().le(0).any(axis=1) & df.duplicated(cols)
vals = df.loc[m, 'Subject']
mask = df['Subject'].isin(vals)
df1 = df[mask]
.filter()
使用lambda
函数查找.diff()
来识别问题
issues=df.groupby('Subject').filter(lambda x : ((x.Marks1.diff()>0)&(x.Marks2.diff()<0)).any())
print(issues)
Subject Marks1 Marks2
4 Science 1.0 10
5 Science 1.5 20
6 Science 1.7 15
7 Science 3.0 35
Noissues=df[~df.index.isin(issues.index)]
print(Noissues)
Subject Marks1 Marks2
0 English 1.0 10
1 English 1.5 20
2 English 1.7 30
3 English 3.0 40