希望在下面正确表达我自己,因为这似乎是一个复杂的问题。
我的部门定期为我们当前的项目组合网站创建随机的每日快照。下面,我过滤了整个表中所有project_id=1的快照。在这里进行筛选,使其更易于理解,因为有许多项目。此外,我已经减少了这个例子的列数。
df_table
project_id project_name region style effect representative lazy timestamp
1 PullPressure EU A-B-C Pull Martin DCA 10/01/20
1 PullPressure EU A-B-C Pull Martin DCA 09/05/20
1 PushPressure EU A-B-C Push Martin 08/20/20
1 PressurePush EU A-B-C Push Martin 04/06/20
1 PressurePush US A-B-C Push Johnsson 12/31/19
1 PressurePush US A-B-C Push Johnsson 10/15/19
我的目标是找出project_id的任何列(或者通常任何key_column(的最后一次更改是什么时候发生的,即给定id的每个单元格最后一次编辑是什么时候?
我的目标是实现这样的目标:
df_table_new:
project_id project_name region style effect representative lazy timestamp
1 08/20/20 04/06/20 10/15/19 09/05/20 04/06/20 09/05/20 10/01/20
1 08/20/20 04/06/20 10/15/19 09/05/20 04/06/20 09/05/20 09/05/20
1 08/20/20 04/06/20 10/15/19 10/15/19 04/06/20 10/15/19 08/20/20
1 10/15/19 04/06/20 10/15/19 10/15/19 04/06/20 10/15/19 04/06/20
1 10/15/19 10/15/19 10/15/19 10/15/19 10/15/19 10/15/19 12/31/19
1 10/15/19 10/15/19 10/15/19 10/15/19 10/15/19 10/15/19 10/15/19
如果有什么不清楚的地方,请告诉我!
编辑:列中有空值,导致NaT错误,例如:
lazy
09/05/20
09/05/20
NaT
NaT
NaT
NaT
然而,字段中的值应该引用时间戳列中最旧的可用时间戳,而不是NaT,即10/15/19。
edit2:通过向函数添加相应的元素,使用@jezrael的解决方案进行求解。非常感谢!
对Index.difference
:生成的每列使用GroupBy.transform
和GroupBy.last
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%y')
for c in df.columns.difference(['project_id','timestamp']):
df[c] = df.groupby(['project_id',c], sort=False)['timestamp'].transform('last')
print (df)
project_id project_name region style effect representative
0 1 2020-09-05 2020-04-06 2019-10-15 2020-09-05 2020-04-06
1 1 2020-09-05 2020-04-06 2019-10-15 2020-09-05 2020-04-06
2 1 2020-08-20 2020-04-06 2019-10-15 2019-10-15 2020-04-06
3 1 2019-10-15 2020-04-06 2019-10-15 2019-10-15 2020-04-06
4 1 2019-10-15 2019-10-15 2019-10-15 2019-10-15 2019-10-15
5 1 2019-10-15 2019-10-15 2019-10-15 2019-10-15 2019-10-15
timestamp
0 2020-10-01
1 2020-09-05
2 2020-08-20
3 2020-04-06
4 2019-12-31
如果需要原始格式,请添加Series.dt.strftime
:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%y')
for c in df.columns.difference(['project_id','timestamp']):
df[c] = (df.groupby(['project_id',c], sort=False)['timestamp'].transform('last')
.dt.strftime('%m/%d/%y'))
print (df)
project_id project_name region style effect representative
0 1 09/05/20 04/06/20 10/15/19 09/05/20 04/06/20
1 1 09/05/20 04/06/20 10/15/19 09/05/20 04/06/20
2 1 08/20/20 04/06/20 10/15/19 10/15/19 04/06/20
3 1 10/15/19 04/06/20 10/15/19 10/15/19 04/06/20
4 1 10/15/19 10/15/19 10/15/19 10/15/19 10/15/19
5 1 10/15/19 10/15/19 10/15/19 10/15/19 10/15/19
timestamp
0 2020-10-01
1 2020-09-05
2 2020-08-20
3 2020-04-06
4 2019-12-31
5 2019-10-15
编辑:按最小时间戳添加fillna
:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%m/%d/%y')
min1 = df['timestamp'].min()
for c in df.columns.difference(['project_id','timestamp']):
df[c] = df.groupby(['project_id',c], sort=False)['timestamp'].transform('last').fillna(min1)
print (df)
project_id project_name region style effect representative
0 1 2020-09-05 2020-04-06 2019-10-15 2020-09-05 2020-04-06
1 1 2020-09-05 2020-04-06 2019-10-15 2020-09-05 2020-04-06
2 1 2020-08-20 2020-04-06 2019-10-15 2019-10-15 2020-04-06
3 1 2019-10-15 2020-04-06 2019-10-15 2019-10-15 2020-04-06
4 1 2019-10-15 2019-10-15 2019-10-15 2019-10-15 2019-10-15
5 1 2019-10-15 2019-10-15 2019-10-15 2019-10-15 2019-10-15
lazy timestamp
0 2020-09-05 2020-10-01
1 2020-09-05 2020-09-05
2 2019-10-15 2020-08-20
3 2019-10-15 2020-04-06
4 2019-10-15 2019-12-31
5 2019-10-15 2019-10-15