熊猫合并和转换为日期时间的速度很慢



我在DataFrame中有两列数据,其中包含日期和时间。两者都以字符串开头。我希望它们最终以日期时间格式合并为一列。

DataFrame的头是:

Date          variable   value
0   '04/10/2020'    '00:30'  81.310
1   '05/10/2020'    '00:30'  121.245
2   '06/10/2020'    '00:30'  77.020
3   '07/10/2020'    '00:30'  100.705
4   '08/10/2020'    '00:30'  114.370

它们位于一个名为df_flattened的DF中,大约有20k行,我目前使用的代码是:

df_flattened['DateTime'] = df_flattened.apply(lambda x: x['Date'] + ' ' + x['variable'], axis=1)
df_flattened['DateTime'] = pd.to_datetime(df_flattened['DateTime'])

然而,这需要大约2.6秒的时间才能运行,而且数据集在未来会变得更大。有人能建议一种快速的方法吗?

您可以将+用于联接列,而不是apply:

df_flattened['DateTime'] = pd.to_datetime(df_flattened['Date'] + ' ' + df_flattened['variable'])

也可以指定加入日期时间的格式:

df_flattened['DateTime'] = pd.to_datetime(df_flattened['Date'] + ' ' + df_flattened['variable'], format='%d/%m/%Y %H:%M')

20k行的性能:

#20k rows
df_flattened = pd.concat([df_flattened] * 4000, ignore_index=True)

In [44]: %%timeit
...: df_flattened['DateTime'] = df_flattened.apply(lambda x: x['Date'] + ' ' + x['variable'], axis=1)
...: df_flattened['DateTime'] = pd.to_datetime(df_flattened['DateTime'])
...: 
...: 
325 ms ± 26.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df_flattened['DateTime'] = pd.to_datetime(df_flattened['Date'] + ' ' + df_flattened['variable'])
11.9 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [46]: %timeit df_flattened['DateTime'] = pd.to_datetime(df_flattened['Date'] + ' ' + df_flattened['variable'], format='%d/%m/%Y %H:%M')
9.55 ms ± 96.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

最新更新