给定日期2020-01-02
重复三次的数据帧
df_original
time
0 2020-01-02 00:00:00
1 2020-01-02 00:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00
我想将其转换为以下形式,其中三个2020-01-02
时间戳在2020-01-01
和2020-01-02
之间均匀分布。
df_expected
time
0 2020-01-01 00:00:00
1 2020-01-01 12:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00
对于time
列长度为~10_000_000
的数据帧,是否有一种有效的方法来做到这一点?
首先,我必须找到重复的时间戳,我想可以使用df.duplicated()
来完成。然后我必须遍历所有重复的块然后像这样做:
pd.date_range(timestamps[0] - Timedelta(days=1), timestamps[0], len(timestamps))
但是我不确定如何在没有循环的情况下实现这一点。
<标题>编辑添加带有预期输出的额外数据帧
df_original
time
0 2020-01-01 00:00:00
1 2020-01-02 00:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00
4 2020-01-03 00:00:00
5 2020-01-03 00:00:00
6 2020-01-04 00:00:00
7 2020-01-04 00:00:00
8 2020-01-04 00:00:00
9 2020-01-04 00:00:00
10 2020-01-05 00:00:00
df_expected
time
0 2020-01-01 00:00:00
1 2020-01-01 00:00:00
2 2020-01-02 00:00:00
3 2020-01-02 00:00:00
4 2020-01-02 12:00:00
5 2020-01-03 00:00:00
6 2020-01-03 00:00:00
7 2020-01-03 08:00:00
8 2020-01-03 16:00:00
9 2020-01-04 00:00:00
10 2020-01-05 00:00:00
这里需要注意的是,在这些示例中,我使用Timedelta(days=1)
作为增量时间来分散重复,但是这个增量可以是任何Timedelta
值
尝试将groupby
与cumcount(ascending=False)
相乘,然后乘以12小时:
df['time'] = pd.to_datetime(df['time'])
df['time'] = df['time'] - df.groupby('time').cumcount(ascending=False).mul(pd.Timedelta(hours=12))
print(df)
输出:
time
0 2020-01-01 00:00:00
1 2020-01-01 12:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00
仅对GroupBy.transform
中time
列的重复值使用自定义函数:
df['time'] = pd.to_datetime(df['time'])
m = df['time'].duplicated(keep=False)
f = lambda x: pd.date_range(x.iat[0] - pd.Timedelta(days=1), x.iat[0], len(x))
df.loc[m, 'time'] = df[m].groupby('time')['time'].transform(f)
print (df)
time
0 2020-01-01 00:00:00
1 2020-01-01 12:00:00
2 2020-01-02 00:00:00
3 2020-01-03 00:00:00
For 4 values:
df['time'] = pd.to_datetime(df['time'])
m = df['time'].duplicated(keep=False)
f = lambda x: pd.date_range(x.iat[0] - pd.Timedelta(days=1), x.iat[0], len(x))
df.loc[m, 'time'] = df[m].groupby('time')['time'].transform(f)
print (df)
time
0 2020-01-01 00:00:00
1 2020-01-01 08:00:00
2 2020-01-01 16:00:00
3 2020-01-02 00:00:00
4 2020-01-03 00:00:00
您可以使用value_counts
和date_range
:
gen_dates = lambda x: pd.date_range(x['index'] - pd.Timedelta(days=1),
x['index'],
periods=x['time'])
df['time'] = df['time'].value_counts().reset_index()
.apply(gen_dates, axis=1)
.explode().reset_index(drop=True)
>>> df
time
0 2020-01-01 00:00:00
1 2020-01-01 12:00:00
2 2020-01-02 00:00:00
3 2020-01-02 00:00:00
For 4 values:
>>> df
time
0 2020-01-01 00:00:00
1 2020-01-01 08:00:00
2 2020-01-01 16:00:00
3 2020-01-02 00:00:00
4 2020-01-02 00:00:00
我已经修改了@U12-Forward的答案,并创建了一个函数来处理我所有的情况。@jezrael的答案更直接,解决了这个问题,但不幸的是,它对我的~20_000_000行数据帧来说很慢。
def evenly_distribute_timestamps(df: DataFrame,
spread_timedelta: Timedelta = Timedelta(seconds=1),
ascending=True, time_column='time', inplace=False):
if not inplace:
df = df.copy()
duplicates = df[time_column].duplicated(keep=False)
spread_factors = spread_timedelta / (df.loc[duplicates, time_column].value_counts() - 1)
df['cumcount'] = df.groupby(time_column).cumcount(ascending=ascending)
cumcount = df[duplicates].set_index(time_column)['cumcount']
df.drop('cumcount', inplace=True, axis=1)
timedelta = (cumcount * spread_factors).reset_index(drop=True)
if ascending:
df.loc[duplicates, time_column] = df.loc[duplicates, time_column] + timedelta.values
else:
df.loc[duplicates, time_column] = df.loc[duplicates, time_column] - timedelta.values
return df
>>> df = pd.DataFrame({'time': [Timestamp('2020-01-01 00:00:00'),
Timestamp('2020-01-02 00:00:00'),
Timestamp('2020-01-02 00:00:00'),
Timestamp('2020-01-03 00:00:00'),
Timestamp('2020-01-03 00:00:00'),
Timestamp('2020-01-03 00:00:00'),
Timestamp('2020-01-04 00:00:00'),
Timestamp('2020-01-04 00:00:00'),
Timestamp('2020-01-04 00:00:00'),
Timestamp('2020-01-04 00:00:00'),
Timestamp('2020-01-05 00:00:00')]})
>>> df
time
0 2020-01-01
1 2020-01-02
2 2020-01-02
3 2020-01-03
4 2020-01-03
5 2020-01-03
6 2020-01-04
7 2020-01-04
8 2020-01-04
9 2020-01-04
10 2020-01-05
>>> evenly_distribute_timestamps(df, Timedelta(days=1), ascending=True)
time
0 2020-01-01 00:00:00
1 2020-01-02 00:00:00
2 2020-01-03 00:00:00
3 2020-01-03 00:00:00
4 2020-01-03 12:00:00
5 2020-01-04 00:00:00
6 2020-01-04 00:00:00
7 2020-01-04 08:00:00
8 2020-01-04 16:00:00
9 2020-01-05 00:00:00
10 2020-01-05 00:00:00
>>> evenly_distribute_timestamps(df, Timedelta(days=1), ascending=False)
time
0 2020-01-01 00:00:00
1 2020-01-01 00:00:00
2 2020-01-02 00:00:00
3 2020-01-02 00:00:00
4 2020-01-02 12:00:00
5 2020-01-03 00:00:00
6 2020-01-03 00:00:00
7 2020-01-03 08:00:00
8 2020-01-03 16:00:00
9 2020-01-04 00:00:00
10 2020-01-05 00:00:00