有没有办法保留具有特定条件的行,如果不满足此条件则删除其他行?



我有以下数据帧 (df(

ID  start                        end                     Diff
A   1/8/2020 12:00:00 AM         1/8/2020 12:00:10 AM    10
A   1/8/2020 12:00:05 AM         1/8/2020 12:00:10 AM    5
B   1/9/2020 1:00:05 AM          1/9/2020 1:00:10 AM     5
B   1/9/2020 1:00:06 AM          1/9/2020 1:00:10 AM     4
B   1/9/2020 1:00:20 AM          1/9/2020 1:00:25 AM     5
C   1/10/2020 5:00:05 AM         1/10/2020 5:00:25 AM    20
C   1/10/2020 5:00:40 AM         1/10/2020 5:00:45 AM    5

问题是这样的: 当结束时间相同时,我想删除持续时间较长的行,删除持续时间最短的行。

期望的结果:

ID  start                      end                     Diff
A   1/8/2020 12:00:05 AM       1/8/2020 12:00:10 AM    5
B   1/9/2020 1:00:06 AM        1/9/2020 1:00:10 AM     4
B   1/9/2020 1:00:20 AM        1/9/2020 1:00:25 AM     5
C   1/10/2020 5:00:05 AM       1/10/2020 5:00:25 AM    20
C   1/10/2020 5:00:40 AM       1/10/2020 5:00:45 AM    5

本质上,当结束时间相同时,我希望删除持续时间较长的行。 我已经尝试过了,但是,它没有考虑到条件: 当结束时间相同时,保留较短的持续时间行

df.sort_values(['Diff']).drop_duplicates(subset=['ID'])

任何建议不胜感激。

在最小值为Diffend列上使用groupby,然后与df['Diff']进行比较并保留返回 True 的列,检查 transform 如何在下面返回整个组的最小值:

df[df['Diff'].eq(df.groupby('end')['Diff'].transform('min'))]

ID                 start                   end  Diff
1  A  1/8/2020 12:00:05 AM  1/8/2020 12:00:10 AM     5
3  B   1/9/2020 1:00:06 AM   1/9/2020 1:00:10 AM     4
4  B   1/9/2020 1:00:20 AM   1/9/2020 1:00:25 AM     5
5  C  1/10/2020 5:00:05 AM  1/10/2020 5:00:25 AM    20
6  C  1/10/2020 5:00:40 AM  1/10/2020 5:00:45 AM     5

groupby+transform输出

print(df.groupby('end')['Diff'].transform('min'))
0     5
1     5
2     4
3     4
4     5
5    20
6     5

print(df['Diff'].eq(df.groupby('end')['Diff'].transform('min')))
0    False
1     True
2    False
3     True
4     True
5     True
6     True

我们可以使用Series.map

df[df['Diff'].eq(df['end'].map(df.groupby('end')['Diff'].min()))]
ID                 start                   end  Diff
1  A  1/8/2020 12:00:05 AM  1/8/2020 12:00:10 AM     5
3  B   1/9/2020 1:00:06 AM   1/9/2020 1:00:10 AM     4
4  B   1/9/2020 1:00:20 AM   1/9/2020 1:00:25 AM     5
5  C  1/10/2020 5:00:05 AM  1/10/2020 5:00:25 AM    20
6  C  1/10/2020 5:00:40 AM  1/10/2020 5:00:45 AM     5

'start'排序,"较短"的持续时间自然会持续下去。 然后使用drop_duplicates

df.sort_values(['ID', 'start', 'end']).drop_duplicates(['ID', 'end'], keep='last')
ID               start                 end  Diff
1  A 2020-01-08 00:00:05 2020-01-08 00:00:10     5
3  B 2020-01-09 01:00:06 2020-01-09 01:00:10     4
4  B 2020-01-09 01:00:20 2020-01-09 01:00:25     5
5  C 2020-01-10 05:00:05 2020-01-10 05:00:25    20
6  C 2020-01-10 05:00:40 2020-01-10 05:00:45     5

IDend排序,然后取Diff最短的那个。

>>> df.sort_values(['ID', 'end', 'Diff']).groupby(['ID', 'end'], sort=False).head(1)
ID                 start                   end  Diff
1  A  1/8/2020 12:00:05 AM  1/8/2020 12:00:10 AM     5
3  B   1/9/2020 1:00:06 AM   1/9/2020 1:00:10 AM     4
4  B   1/9/2020 1:00:20 AM   1/9/2020 1:00:25 AM     5
5  C  1/10/2020 5:00:05 AM  1/10/2020 5:00:25 AM    20
6  C  1/10/2020 5:00:40 AM  1/10/2020 5:00:45 AM     5

最新更新