如何知道一个记录是否被修改或包含在pandas数据帧中



我有一个数据帧,其中可以包括新行,但我必须知道进入数据帧的新行是对某个现有记录的修改,还是相反,它是一个新记录。

例如,输入数据帧:

人口>结束时间戳2021:06-30 00:00:00>>
A B开始
A1 B1 100 2021-05-15 00:00:002021-07-06 00:00:00[/td>
A1 B1 250 2021-05-30 00:00:002021-06-02 00:00:002021-06-06 00:00:00[/td>
A2 B3 350 2021-05-10 00:00:002021-05-12 00:00:002021-07-06 00:00:00[/td>
A2 B4 125 2021-06-02 00:00:00 2021:06-04 00:00:00 2021-07-06 00:00:00[/td>

因此,如果按时间戳排序,并在定义唯一行的列上使用groupby,则可以获得所需的所有信息。使用last获取每组中的最后一行,使用nth获取倒数第二行:

>>> groups = df.sort_values('timestamp').groupby(['A', 'B'])
>>> groups.last()
Population                 Start                   End            timestamp
A   B                                                                               
A1  B1          100  2021-05-15 00:00:00   2021-06-30 00:00:00   2021-07-06 00:00:00
A2  B3          350  2021-05-10 00:00:00   2021-05-12 00:00:00   2021-07-06 00:00:00
B4          125  2021-06-02 00:00:00   2021-06-04 00:00:00   2021-07-06 00:00:00
>>> groups.nth(-2)
A1  B1          250  2021-05-30 00:00:00   2021-06-02 00:00:00   2021-06-06 00:00:00

现在,所有这些数据帧都在列AB上建立了索引,所以您可以简单地添加后缀join,重置索引,就完成了:

>>> mod = groups.last().join(groups.nth(-2), rsuffix='_prev').reset_index()
>>> mod
A    B  Population                 Start                   End            timestamp  Population_prev            Start_prev              End_prev       timestamp_prev
0  A1   B1          100  2021-05-15 00:00:00   2021-06-30 00:00:00   2021-07-06 00:00:00            250.0  2021-05-30 00:00:00   2021-06-02 00:00:00   2021-06-06 00:00:00
1  A2   B3          350  2021-05-10 00:00:00   2021-05-12 00:00:00   2021-07-06 00:00:00              NaN                   NaN                   NaN                  NaN
2  A2   B4          125  2021-06-02 00:00:00   2021-06-04 00:00:00   2021-07-06 00:00:00              NaN                   NaN                   NaN                  NaN

然后是一些细节,让它看起来像你所拥有的:

>>> col_order = [
...     *df.columns[:2],
...     *(new_col for col in df.columns[2:-1] for new_col in [col, f'{col}_prev']),
...     'type', 'timestamp'
... ]
>>> row_type = mod['timestamp_prev'].isna().map({True: 'New', False: 'Mod'})
>>> mod.join(row_type.rename('type')).reindex(col_order, axis='columns')
A    B  Population  Population_prev                 Start            Start_prev                   End              End_prev type            timestamp
0  A1   B1          100            250.0  2021-05-15 00:00:00   2021-05-30 00:00:00   2021-06-30 00:00:00   2021-06-02 00:00:00   Mod  2021-07-06 00:00:00
1  A2   B3          350              NaN  2021-05-10 00:00:00                    NaN  2021-05-12 00:00:00                    NaN  New  2021-07-06 00:00:00
2  A2   B4          125              NaN  2021-06-02 00:00:00                    NaN  2021-06-04 00:00:00                    NaN  New  2021-07-06 00:00:00

另一种适用于任意数量重复值的技术是使用pivot。让我们使用相同的groupby,但使用cumcount()来定义列的顺序:

>>> num = df.sort_values('timestamp').groupby(['A', 'B']).cumcount().rename('num')
>>> num
1    0
0    1
2    0
3    0
Name: num, dtype: int64
>>> pvt = df.join(num).pivot(index=['A', 'B'], columns='num', values=['Population', 'Start', 'End'])
>>> pvt
Population                     Start                                       End                     
num            0    1                    0                    1                    0                    1
A  B                                                                                                     
A1 B1        250  100  2021-05-30 00:00:00  2021-05-15 00:00:00  2021-06-02 00:00:00  2021-06-30 00:00:00
A2 B3        350  NaN  2021-05-10 00:00:00                  NaN  2021-05-12 00:00:00                  NaN
B4        125  NaN  2021-06-02 00:00:00                  NaN  2021-06-04 00:00:00                  NaN

正如您所看到的,它提供了您想要的内容,但在列中有一个多索引。让我们把它展平为普通列,我们就完成了:

>>> pvt.columns = [f'{col}_prev{n if n > 1 else ""}' if n > 0 else col for col, n in pvt.columns]
>>> pvt.reset_index()
A   B Population Population_prev                Start           Start_prev                  End             End_prev
0  A1  B1        250             100  2021-05-30 00:00:00  2021-05-15 00:00:00  2021-06-02 00:00:00  2021-06-30 00:00:00
1  A2  B3        350             NaN  2021-05-10 00:00:00                  NaN  2021-05-12 00:00:00                  NaN
2  A2  B4        125             NaN  2021-06-02 00:00:00                  NaN  2021-06-04 00:00:00                  NaN

相关内容

  • 没有找到相关文章

最新更新