Pandas:查找数据框中每个元素的上次编辑日期



希望在下面正确表达我自己,因为这似乎是一个复杂的问题。

我的部门定期为我们当前的项目组合网站创建随机的每日快照。下面,我过滤了整个表中所有project_id=1的快照。在这里进行筛选,使其更易于理解,因为有许多项目。此外,我已经减少了这个例子的列数。

df_table

project_id  project_name  region  style   effect   representative lazy  timestamp
1           PullPressure  EU      A-B-C   Pull     Martin         DCA   10/01/20
1           PullPressure  EU      A-B-C   Pull     Martin         DCA   09/05/20
1           PushPressure  EU      A-B-C   Push     Martin               08/20/20
1           PressurePush  EU      A-B-C   Push     Martin               04/06/20
1           PressurePush  US      A-B-C   Push     Johnsson             12/31/19
1           PressurePush  US      A-B-C   Push     Johnsson             10/15/19

我的目标是找出project_id的任何列(或者通常任何key_column(的最后一次更改是什么时候发生的,即给定id的每个单元格最后一次编辑是什么时候?

我的目标是实现这样的目标:

df_table_new:

project_id  project_name region       style        effect       representative  lazy        timestamp
1           08/20/20     04/06/20     10/15/19     09/05/20     04/06/20        09/05/20    10/01/20
1           08/20/20     04/06/20     10/15/19     09/05/20     04/06/20        09/05/20    09/05/20
1           08/20/20     04/06/20     10/15/19     10/15/19     04/06/20        10/15/19    08/20/20
1           10/15/19     04/06/20     10/15/19     10/15/19     04/06/20        10/15/19    04/06/20
1           10/15/19     10/15/19     10/15/19     10/15/19     10/15/19        10/15/19    12/31/19
1           10/15/19     10/15/19     10/15/19     10/15/19     10/15/19        10/15/19    10/15/19

如果有什么不清楚的地方,请告诉我!

编辑:列中有空值,导致NaT错误,例如:

lazy
09/05/20
09/05/20
NaT
NaT
NaT
NaT

然而,字段中的值应该引用时间戳列中最旧的可用时间戳,而不是NaT,即10/15/19。

edit2:通过向函数添加相应的元素,使用@jezrael的解决方案进行求解。非常感谢!

Index.difference:生成的每列使用GroupBy.transformGroupBy.last

df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%y')
for c in df.columns.difference(['project_id','timestamp']):
df[c] = df.groupby(['project_id',c], sort=False)['timestamp'].transform('last')
print (df)
project_id project_name     region      style     effect representative  
0           1   2020-09-05 2020-04-06 2019-10-15 2020-09-05     2020-04-06   
1           1   2020-09-05 2020-04-06 2019-10-15 2020-09-05     2020-04-06   
2           1   2020-08-20 2020-04-06 2019-10-15 2019-10-15     2020-04-06   
3           1   2019-10-15 2020-04-06 2019-10-15 2019-10-15     2020-04-06   
4           1   2019-10-15 2019-10-15 2019-10-15 2019-10-15     2019-10-15   
5           1   2019-10-15 2019-10-15 2019-10-15 2019-10-15     2019-10-15   
timestamp  
0 2020-10-01  
1 2020-09-05  
2 2020-08-20  
3 2020-04-06  
4 2019-12-31  

如果需要原始格式,请添加Series.dt.strftime:

df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%y')
for c in df.columns.difference(['project_id','timestamp']):
df[c] = (df.groupby(['project_id',c], sort=False)['timestamp'].transform('last')
.dt.strftime('%m/%d/%y'))
print (df)
project_id project_name    region     style    effect representative  
0           1     09/05/20  04/06/20  10/15/19  09/05/20       04/06/20   
1           1     09/05/20  04/06/20  10/15/19  09/05/20       04/06/20   
2           1     08/20/20  04/06/20  10/15/19  10/15/19       04/06/20   
3           1     10/15/19  04/06/20  10/15/19  10/15/19       04/06/20   
4           1     10/15/19  10/15/19  10/15/19  10/15/19       10/15/19   
5           1     10/15/19  10/15/19  10/15/19  10/15/19       10/15/19   
timestamp  
0 2020-10-01  
1 2020-09-05  
2 2020-08-20  
3 2020-04-06  
4 2019-12-31  
5 2019-10-15  

编辑:按最小时间戳添加fillna

df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%y')
min1 = df['timestamp'].min()
for c in df.columns.difference(['project_id','timestamp']):
df[c] = df.groupby(['project_id',c], sort=False)['timestamp'].transform('last').fillna(min1)
print (df)
project_id project_name     region      style     effect representative  
0           1   2020-09-05 2020-04-06 2019-10-15 2020-09-05     2020-04-06   
1           1   2020-09-05 2020-04-06 2019-10-15 2020-09-05     2020-04-06   
2           1   2020-08-20 2020-04-06 2019-10-15 2019-10-15     2020-04-06   
3           1   2019-10-15 2020-04-06 2019-10-15 2019-10-15     2020-04-06   
4           1   2019-10-15 2019-10-15 2019-10-15 2019-10-15     2019-10-15   
5           1   2019-10-15 2019-10-15 2019-10-15 2019-10-15     2019-10-15   
lazy  timestamp  
0 2020-09-05 2020-10-01  
1 2020-09-05 2020-09-05  
2 2019-10-15 2020-08-20  
3 2019-10-15 2020-04-06  
4 2019-10-15 2019-12-31  
5 2019-10-15 2019-10-15  

最新更新