我有一个数据帧,其中可以包括新行,但我必须知道进入数据帧的新行是对某个现有记录的修改,还是相反,它是一个新记录。
例如,输入数据帧:
A | B | 人口开始 | >结束时间戳||||
---|---|---|---|---|---|---|
A1 | B1 | 100 | 2021-05-15 00:00:00 | 2021:06-30 00:00:002021-07-06 00:00:00[/td> | ||
A1 | B1 | 250 | 2021-05-30 00:00:00 | >2021-06-02 00:00:00 | 2021-06-06 00:00:00[/td> | |
A2 | B3 | 350 | 2021-05-10 00:00:00 | >2021-05-12 00:00:00 | 2021-07-06 00:00:00[/td> | |
A2 | B4 | 125 | 2021-06-02 00:00:00 | 2021:06-04 00:00:00 | 2021-07-06 00:00:00[/td> |
因此,如果按时间戳排序,并在定义唯一行的列上使用groupby
,则可以获得所需的所有信息。使用last
获取每组中的最后一行,使用nth
获取倒数第二行:
>>> groups = df.sort_values('timestamp').groupby(['A', 'B'])
>>> groups.last()
Population Start End timestamp
A B
A1 B1 100 2021-05-15 00:00:00 2021-06-30 00:00:00 2021-07-06 00:00:00
A2 B3 350 2021-05-10 00:00:00 2021-05-12 00:00:00 2021-07-06 00:00:00
B4 125 2021-06-02 00:00:00 2021-06-04 00:00:00 2021-07-06 00:00:00
>>> groups.nth(-2)
A1 B1 250 2021-05-30 00:00:00 2021-06-02 00:00:00 2021-06-06 00:00:00
现在,所有这些数据帧都在列A
和B
上建立了索引,所以您可以简单地添加后缀join
,重置索引,就完成了:
>>> mod = groups.last().join(groups.nth(-2), rsuffix='_prev').reset_index()
>>> mod
A B Population Start End timestamp Population_prev Start_prev End_prev timestamp_prev
0 A1 B1 100 2021-05-15 00:00:00 2021-06-30 00:00:00 2021-07-06 00:00:00 250.0 2021-05-30 00:00:00 2021-06-02 00:00:00 2021-06-06 00:00:00
1 A2 B3 350 2021-05-10 00:00:00 2021-05-12 00:00:00 2021-07-06 00:00:00 NaN NaN NaN NaN
2 A2 B4 125 2021-06-02 00:00:00 2021-06-04 00:00:00 2021-07-06 00:00:00 NaN NaN NaN NaN
然后是一些细节,让它看起来像你所拥有的:
>>> col_order = [
... *df.columns[:2],
... *(new_col for col in df.columns[2:-1] for new_col in [col, f'{col}_prev']),
... 'type', 'timestamp'
... ]
>>> row_type = mod['timestamp_prev'].isna().map({True: 'New', False: 'Mod'})
>>> mod.join(row_type.rename('type')).reindex(col_order, axis='columns')
A B Population Population_prev Start Start_prev End End_prev type timestamp
0 A1 B1 100 250.0 2021-05-15 00:00:00 2021-05-30 00:00:00 2021-06-30 00:00:00 2021-06-02 00:00:00 Mod 2021-07-06 00:00:00
1 A2 B3 350 NaN 2021-05-10 00:00:00 NaN 2021-05-12 00:00:00 NaN New 2021-07-06 00:00:00
2 A2 B4 125 NaN 2021-06-02 00:00:00 NaN 2021-06-04 00:00:00 NaN New 2021-07-06 00:00:00
另一种适用于任意数量重复值的技术是使用pivot
。让我们使用相同的groupby,但使用cumcount()
来定义列的顺序:
>>> num = df.sort_values('timestamp').groupby(['A', 'B']).cumcount().rename('num')
>>> num
1 0
0 1
2 0
3 0
Name: num, dtype: int64
>>> pvt = df.join(num).pivot(index=['A', 'B'], columns='num', values=['Population', 'Start', 'End'])
>>> pvt
Population Start End
num 0 1 0 1 0 1
A B
A1 B1 250 100 2021-05-30 00:00:00 2021-05-15 00:00:00 2021-06-02 00:00:00 2021-06-30 00:00:00
A2 B3 350 NaN 2021-05-10 00:00:00 NaN 2021-05-12 00:00:00 NaN
B4 125 NaN 2021-06-02 00:00:00 NaN 2021-06-04 00:00:00 NaN
正如您所看到的,它提供了您想要的内容,但在列中有一个多索引。让我们把它展平为普通列,我们就完成了:
>>> pvt.columns = [f'{col}_prev{n if n > 1 else ""}' if n > 0 else col for col, n in pvt.columns]
>>> pvt.reset_index()
A B Population Population_prev Start Start_prev End End_prev
0 A1 B1 250 100 2021-05-30 00:00:00 2021-05-15 00:00:00 2021-06-02 00:00:00 2021-06-30 00:00:00
1 A2 B3 350 NaN 2021-05-10 00:00:00 NaN 2021-05-12 00:00:00 NaN
2 A2 B4 125 NaN 2021-06-02 00:00:00 NaN 2021-06-04 00:00:00 NaN