对具有多个条件的分组数据帧进行自定义筛选



请在下面找到我的输入/期望输出:

输入

Id   Status        Date
0  Id001      off  01/01/2022
1  Id001   online  01/06/2022
2  Id002      off  05/12/2021
3  Id002      off  30/08/2021
4  Id003  running  02/03/2021
5  Id004   online  21/10/2021
6  Id004  running  01/04/2022
7  Id005   online  01/02/2022
8  Id005   online  02/07/2022

输出(所需(

Id   Status        Date
0  Id001   online  01/06/2022
1  Id002      off  05/12/2021
2  Id003  running  02/03/2021
3  Id004   online  21/10/2021
4  Id005   online  01/02/2022

目标首先是在输出中每个Id有一行
输出将基于以下简单语句:

if a group of Id has «online» as a Status then:  
if «online» occurences > 1 then: 
Choose the oldest Id that has «online»
else:  
choose the Id that has «online»
else:
Choose the most recent Id

这就是我迄今为止所尝试的:


import pandas as pd
import numpy as np
df = pd.DataFrame({'Id': ['Id001','Id001','Id002','Id002','Id003','Id004','Id004','Id005','Id005'],
'Status': ['off','online','off','off','running','online','running','online','online'],
'Date': ['01/01/2022','01/06/2022','05/12/2021','30/08/2021','02/03/2021','21/10/2021','01/04/2022','01/02/2022','02/07/2022']})
df.groupby(['Id', 'Status'], as_index=False).agg({'Date' : [np.min, np.max]}).reset_index()

你有什么建议吗
如有任何帮助,我们将不胜感激!

答案完全编辑-首先只过滤online行,按Date排序,并按第一个Id:删除重复项

df1 = df[df['Status'].eq('online')].sort_values('Date').drop_duplicates('Id')
print (df1)
Id  Status       Date
5  Id004  online 2021-10-21
7  Id005  online 2022-02-01
1  Id001  online 2022-06-01

然后过滤不匹配的Id并按降序排序:

df2 =df[~df['Id'].isin(df1['Id'])].sort_values('Date',ascending=False).drop_duplicates('Id')
print (df2)
Id   Status       Date
2  Id002      off 2021-12-05
4  Id003  running 2021-03-02

最后加入两个数据帧:

df = pd.concat([df1, df2]).sort_values('Id', ignore_index=True)
print (df)
Id   Status       Date
0  Id001   online 2022-06-01
1  Id002      off 2021-12-05
2  Id003  running 2021-03-02
3  Id004   online 2021-10-21
4  Id005   online 2022-02-01

应更改原始解决方案:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = (df.assign(s = df['Status'].eq('online')).groupby(['Id','s'])
.agg(Date_min=('Date','idxmin'), Date_max=('Date','idxmax')))
df1 = df1[~df1.index.get_level_values(0).duplicated(keep='last')].reset_index()
print (df1)
Id      s  Date_min  Date_max
0  Id001   True         1         1
1  Id002  False         3         2
2  Id003  False         4         4
3  Id004   True         5         5
4  Id005   True         7         8
df = df.loc[np.where(df1['s'], df1['Date_min'], df1['Date_max'])]
print (df)
Id   Status       Date
1  Id001   online 2022-06-01
2  Id002      off 2021-12-05
4  Id003  running 2021-03-02
5  Id004   online 2021-10-21
7  Id005   online 2022-02-01

最新更新