在预定义的日期范围内使用稀疏的时间戳进行高效的数据采样

考虑一个具有稀疏时间数据的数据帧。时间戳可以是非常旧的（例如几年前）或非常新的。

作为一个例子，让我们以以下数据帧为例：

                    tstamp     item_id   budget
2016-07-01 14:56:51.882649  0Szr8SuNbY  5000.00
2016-07-20 14:57:23.856878  0Szr8SuNbY  5700.00
2016-07-17 16:32:27.838435  0Lzu1xOM87   303.51
2016-07-30 21:50:03.655102  0Lzu1xOM87    94.79
2016-08-01 14:56:31.081140  0HzuoujTsN   100.00

假设我们需要为每个item_id对该数据帧进行重采样，以便使用前向填充获得密集数据帧，该数据帧在预定义日期范围内每天有一个数据点。

换句话说，如果我对时间间隔重新采样

pd.date_range(date(2016,7,15), date(2016,7,31)

我应该得到：

        date     item_id   budget
  2016-07-15  0Szr8SuNbY  5000.00
  2016-07-16  0Szr8SuNbY  5000.00
  2016-07-17  0Szr8SuNbY  5000.00
  ...
  2016-07-31  0Szr8SuNbY  5000.00
  2016-07-15  0Lzu1xOM87      NaN
  2016-07-16  0Lzu1xOM87      NaN
  2016-07-17  0Lzu1xOM87   303.51
  ...
  2016-07-31  0Lzu1xOM87    94.79
  2016-07-15  0HzuoujTsN      NaN
  2016-07-16  0HzuoujTsN      NaN
  2016-07-17  0HzuoujTsN      NaN
  ...
  2016-07-31  0HzuoujTsN      NaN

请注意，原始数据帧包含稀疏的时间戳和可能非常高数量的唯一item_ids。换句话说，我希望找到一种计算高效的方法，在预定义的考虑时间段内以每日频率重新采样该数据。

在Pandas、numpy或Python中，我们能做的最好的事情是什么？

您可以在'item_id'上执行groupby，并在每个组上调用reindex：

# Define the new time interval.
new_dates = pd.date_range('2016-07-15', '2016-07-31', name='date')
# Set the current time stamp as the index and perform the groupby.
df = df.set_index(['tstamp'])
df = df.groupby('item_id').apply(lambda grp: grp['budget'].reindex(new_dates, method='ffill').to_frame())
# Reset the index to remove 'item_id' and 'date' from the index.
df = df.reset_index()

另一个选项是pivot、reindex和unstack:

# Define the new time interval.
new_dates = pd.date_range('2016-07-15', '2016-07-31', name='date')
# Pivot to have 'item_id' columns with 'budget' values.
df = df.pivot(index='tstamp', columns='item_id', values='budget').ffill()
# Reindex with the new dates.
df = df.reindex(new_dates, method='ffill')
# Unstack and reset the index to return to the original format.
df = df.unstack().reset_index().rename(columns={0:'budget'})

相关内容

最新更新

热门标签：