请在下面找到我的输入/期望输出:
输入
Id Status Date
0 Id001 off 01/01/2022
1 Id001 online 01/06/2022
2 Id002 off 05/12/2021
3 Id002 off 30/08/2021
4 Id003 running 02/03/2021
5 Id004 online 21/10/2021
6 Id004 running 01/04/2022
7 Id005 online 01/02/2022
8 Id005 online 02/07/2022
输出(所需(
Id Status Date
0 Id001 online 01/06/2022
1 Id002 off 05/12/2021
2 Id003 running 02/03/2021
3 Id004 online 21/10/2021
4 Id005 online 01/02/2022
目标首先是在输出中每个Id有一行
输出将基于以下简单语句:
if a group of Id has «online» as a Status then:
if «online» occurences > 1 then:
Choose the oldest Id that has «online»
else:
choose the Id that has «online»
else:
Choose the most recent Id
这就是我迄今为止所尝试的:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Id': ['Id001','Id001','Id002','Id002','Id003','Id004','Id004','Id005','Id005'],
'Status': ['off','online','off','off','running','online','running','online','online'],
'Date': ['01/01/2022','01/06/2022','05/12/2021','30/08/2021','02/03/2021','21/10/2021','01/04/2022','01/02/2022','02/07/2022']})
df.groupby(['Id', 'Status'], as_index=False).agg({'Date' : [np.min, np.max]}).reset_index()
你有什么建议吗
如有任何帮助,我们将不胜感激!
答案完全编辑-首先只过滤online
行,按Date
排序,并按第一个Id
:删除重复项
df1 = df[df['Status'].eq('online')].sort_values('Date').drop_duplicates('Id')
print (df1)
Id Status Date
5 Id004 online 2021-10-21
7 Id005 online 2022-02-01
1 Id001 online 2022-06-01
然后过滤不匹配的Id
并按降序排序:
df2 =df[~df['Id'].isin(df1['Id'])].sort_values('Date',ascending=False).drop_duplicates('Id')
print (df2)
Id Status Date
2 Id002 off 2021-12-05
4 Id003 running 2021-03-02
最后加入两个数据帧:
df = pd.concat([df1, df2]).sort_values('Id', ignore_index=True)
print (df)
Id Status Date
0 Id001 online 2022-06-01
1 Id002 off 2021-12-05
2 Id003 running 2021-03-02
3 Id004 online 2021-10-21
4 Id005 online 2022-02-01
应更改原始解决方案:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df1 = (df.assign(s = df['Status'].eq('online')).groupby(['Id','s'])
.agg(Date_min=('Date','idxmin'), Date_max=('Date','idxmax')))
df1 = df1[~df1.index.get_level_values(0).duplicated(keep='last')].reset_index()
print (df1)
Id s Date_min Date_max
0 Id001 True 1 1
1 Id002 False 3 2
2 Id003 False 4 4
3 Id004 True 5 5
4 Id005 True 7 8
df = df.loc[np.where(df1['s'], df1['Date_min'], df1['Date_max'])]
print (df)
Id Status Date
1 Id001 online 2022-06-01
2 Id002 off 2021-12-05
4 Id003 running 2021-03-02
5 Id004 online 2021-10-21
7 Id005 online 2022-02-01