我有以下数据帧 (df(
ID start end Diff
A 1/8/2020 12:00:00 AM 1/8/2020 12:00:10 AM 10
A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
B 1/9/2020 1:00:05 AM 1/9/2020 1:00:10 AM 5
B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5
问题是这样的: 当结束时间相同时,我想删除持续时间较长的行,删除持续时间最短的行。
期望的结果:
ID start end Diff
A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5
本质上,当结束时间相同时,我希望删除持续时间较长的行。 我已经尝试过了,但是,它没有考虑到条件: 当结束时间相同时,保留较短的持续时间行
df.sort_values(['Diff']).drop_duplicates(subset=['ID'])
任何建议不胜感激。
在最小值为Diff
的end
列上使用groupby
,然后与df['Diff']
进行比较并保留返回 True 的列,检查 transform 如何在下面返回整个组的最小值:
df[df['Diff'].eq(df.groupby('end')['Diff'].transform('min'))]
ID start end Diff
1 A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
3 B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
4 B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
5 C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
6 C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5
groupby+transform
输出
print(df.groupby('end')['Diff'].transform('min'))
0 5
1 5
2 4
3 4
4 5
5 20
6 5
print(df['Diff'].eq(df.groupby('end')['Diff'].transform('min')))
0 False
1 True
2 False
3 True
4 True
5 True
6 True
我们可以使用Series.map
df[df['Diff'].eq(df['end'].map(df.groupby('end')['Diff'].min()))]
ID start end Diff
1 A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
3 B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
4 B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
5 C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
6 C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5
按'start'
排序,"较短"的持续时间自然会持续下去。 然后使用drop_duplicates
df.sort_values(['ID', 'start', 'end']).drop_duplicates(['ID', 'end'], keep='last')
ID start end Diff
1 A 2020-01-08 00:00:05 2020-01-08 00:00:10 5
3 B 2020-01-09 01:00:06 2020-01-09 01:00:10 4
4 B 2020-01-09 01:00:20 2020-01-09 01:00:25 5
5 C 2020-01-10 05:00:05 2020-01-10 05:00:25 20
6 C 2020-01-10 05:00:40 2020-01-10 05:00:45 5
按ID
和end
排序,然后取Diff
最短的那个。
>>> df.sort_values(['ID', 'end', 'Diff']).groupby(['ID', 'end'], sort=False).head(1)
ID start end Diff
1 A 1/8/2020 12:00:05 AM 1/8/2020 12:00:10 AM 5
3 B 1/9/2020 1:00:06 AM 1/9/2020 1:00:10 AM 4
4 B 1/9/2020 1:00:20 AM 1/9/2020 1:00:25 AM 5
5 C 1/10/2020 5:00:05 AM 1/10/2020 5:00:25 AM 20
6 C 1/10/2020 5:00:40 AM 1/10/2020 5:00:45 AM 5