在列中均匀分布重复的时间戳

  • 本文关键字:时间戳 分布 python pandas
  • 更新时间 :
  • 英文 :


给定日期2020-01-02重复三次的数据帧

df_original
time
0 2020-01-02 00:00:00
1 2020-01-02 00:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00

我想将其转换为以下形式,其中三个2020-01-02时间戳在2020-01-012020-01-02之间均匀分布。

df_expected
time
0 2020-01-01 00:00:00
1 2020-01-01 12:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00

对于time列长度为~10_000_000的数据帧,是否有一种有效的方法来做到这一点?

首先,我必须找到重复的时间戳,我想可以使用df.duplicated()来完成。然后我必须遍历所有重复的块然后像这样做:

pd.date_range(timestamps[0] - Timedelta(days=1), timestamps[0], len(timestamps))

但是我不确定如何在没有循环的情况下实现这一点。

<标题>

编辑添加带有预期输出的额外数据帧

df_original
time
0  2020-01-01 00:00:00
1  2020-01-02 00:00:00
2  2020-01-02 00:00:00
3  2020-01-03 00:00:00
4  2020-01-03 00:00:00
5  2020-01-03 00:00:00
6  2020-01-04 00:00:00
7  2020-01-04 00:00:00
8  2020-01-04 00:00:00
9  2020-01-04 00:00:00
10 2020-01-05 00:00:00
df_expected
time
0  2020-01-01 00:00:00
1  2020-01-01 00:00:00
2  2020-01-02 00:00:00
3  2020-01-02 00:00:00
4  2020-01-02 12:00:00
5  2020-01-03 00:00:00
6  2020-01-03 00:00:00
7  2020-01-03 08:00:00
8  2020-01-03 16:00:00
9  2020-01-04 00:00:00
10 2020-01-05 00:00:00 

这里需要注意的是,在这些示例中,我使用Timedelta(days=1)作为增量时间来分散重复,但是这个增量可以是任何Timedelta

尝试将groupbycumcount(ascending=False)相乘,然后乘以12小时:

df['time'] = pd.to_datetime(df['time'])
df['time'] = df['time'] - df.groupby('time').cumcount(ascending=False).mul(pd.Timedelta(hours=12))
print(df)

输出:

time
0 2020-01-01 00:00:00
1 2020-01-01 12:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00

仅对GroupBy.transformtime列的重复值使用自定义函数:

df['time'] = pd.to_datetime(df['time'])
m = df['time'].duplicated(keep=False)
f = lambda x: pd.date_range(x.iat[0] - pd.Timedelta(days=1), x.iat[0], len(x))
df.loc[m, 'time'] = df[m].groupby('time')['time'].transform(f)
print (df)
time
0 2020-01-01 00:00:00
1 2020-01-01 12:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00

For 4 values:

df['time'] = pd.to_datetime(df['time'])
m = df['time'].duplicated(keep=False)
f = lambda x: pd.date_range(x.iat[0] - pd.Timedelta(days=1), x.iat[0], len(x))
df.loc[m, 'time'] = df[m].groupby('time')['time'].transform(f)
print (df)
time
0 2020-01-01 00:00:00
1 2020-01-01 08:00:00
2 2020-01-01 16:00:00
3 2020-01-02 00:00:00
4 2020-01-03 00:00:00

您可以使用value_countsdate_range:

gen_dates = lambda x: pd.date_range(x['index'] - pd.Timedelta(days=1),
x['index'],
periods=x['time'])
df['time'] = df['time'].value_counts().reset_index() 
.apply(gen_dates, axis=1) 
.explode().reset_index(drop=True)
>>> df
time
0 2020-01-01 00:00:00
1 2020-01-01 12:00:00
2 2020-01-02 00:00:00
3 2020-01-02 00:00:00

For 4 values:

>>> df
time
0 2020-01-01 00:00:00
1 2020-01-01 08:00:00
2 2020-01-01 16:00:00
3 2020-01-02 00:00:00
4 2020-01-02 00:00:00

我已经修改了@U12-Forward的答案,并创建了一个函数来处理我所有的情况。@jezrael的答案更直接,解决了这个问题,但不幸的是,它对我的~20_000_000行数据帧来说很慢。

def evenly_distribute_timestamps(df: DataFrame, 
spread_timedelta: Timedelta = Timedelta(seconds=1), 
ascending=True, time_column='time', inplace=False):
if not inplace:
df = df.copy()
duplicates = df[time_column].duplicated(keep=False)
spread_factors = spread_timedelta / (df.loc[duplicates, time_column].value_counts() - 1)
df['cumcount'] = df.groupby(time_column).cumcount(ascending=ascending)
cumcount = df[duplicates].set_index(time_column)['cumcount']
df.drop('cumcount', inplace=True, axis=1)
timedelta = (cumcount * spread_factors).reset_index(drop=True)
if ascending:
df.loc[duplicates, time_column] = df.loc[duplicates, time_column] + timedelta.values
else:
df.loc[duplicates, time_column] = df.loc[duplicates, time_column] - timedelta.values
return df
>>> df = pd.DataFrame({'time': [Timestamp('2020-01-01 00:00:00'),
Timestamp('2020-01-02 00:00:00'),
Timestamp('2020-01-02 00:00:00'),
Timestamp('2020-01-03 00:00:00'),
Timestamp('2020-01-03 00:00:00'),
Timestamp('2020-01-03 00:00:00'),
Timestamp('2020-01-04 00:00:00'),
Timestamp('2020-01-04 00:00:00'),
Timestamp('2020-01-04 00:00:00'),
Timestamp('2020-01-04 00:00:00'),
Timestamp('2020-01-05 00:00:00')]})
>>> df
time
0  2020-01-01
1  2020-01-02
2  2020-01-02
3  2020-01-03
4  2020-01-03
5  2020-01-03
6  2020-01-04
7  2020-01-04
8  2020-01-04
9  2020-01-04
10 2020-01-05
>>> evenly_distribute_timestamps(df, Timedelta(days=1), ascending=True)     
time
0  2020-01-01 00:00:00
1  2020-01-02 00:00:00
2  2020-01-03 00:00:00
3  2020-01-03 00:00:00
4  2020-01-03 12:00:00
5  2020-01-04 00:00:00
6  2020-01-04 00:00:00
7  2020-01-04 08:00:00
8  2020-01-04 16:00:00
9  2020-01-05 00:00:00
10 2020-01-05 00:00:00
>>> evenly_distribute_timestamps(df, Timedelta(days=1), ascending=False)
time
0  2020-01-01 00:00:00
1  2020-01-01 00:00:00
2  2020-01-02 00:00:00
3  2020-01-02 00:00:00
4  2020-01-02 12:00:00
5  2020-01-03 00:00:00
6  2020-01-03 00:00:00
7  2020-01-03 08:00:00
8  2020-01-03 16:00:00
9  2020-01-04 00:00:00
10 2020-01-05 00:00:00

最新更新