如何在熊猫中快速处理日期



我有 200000 行的数据帧。每条记录都有一个时间戳,我需要按日期对它们进行分组。所以我这样做:

In [67]: df['result_date'][0]
Out[67]: Timestamp('2017-09-01 09:12:00')
In [68]: %timeit df['result_day'] = df['result_date'].apply(lambda x: str(x.date()))
2.26 s ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [69]: df['result_day'][0]
Out[69]: '2017-09-01'

In [70]: %timeit df['result_day'] = df['result_date'].apply(lambda x: x.date())
2.05 s ± 213 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [71]: df['result_day'][0]
Out[71]: datetime.date(2017, 9, 1)

无论如何,它需要~2秒。我可以做得更快吗?

上级:

In [75]: df.shape
Out[75]: (228217, 18)
In [77]: %timeit df['result_date'].dt.date
1.44 s ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用耶兹雷尔的例子。你几乎从不想实际使用.date;这将创建 Python 对象。 .normalize() 将日期上的时间设置为 00:00:00,有效地使其成为日期,但保持它们以高性能的 datetime64[ns] 格式。

In [32]: rng = pd.date_range('2000-04-03', periods=200000, freq='2H')
    ...: df = pd.DataFrame({'result_date': rng})  
    ...: 
In [33]: %timeit df['result_date'].dt.date
482 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]: %timeit df['result_date'].dt.normalize()
16.3 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

分组

In [39]: %timeit df.groupby(df['result_date'].dt.date).size()
506 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [40]: %timeit df.groupby(df['result_date'].dt.normalize()).size()
24.2 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

或习惯性地

In [38]: %timeit df.resample('D', on='result_date').size()
5.47 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用 dt.date ,它也可以与NaT一起使用非常好:

df['result_day'] = df['result_date'].dt.date

rng = pd.date_range('2000-04-03', periods=200000, freq='2H')
df = pd.DataFrame({'result_date': rng})  
In [216]: %timeit df['result_date'].dt.date
1 loop, best of 3: 474 ms per loop
In [217]: %timeit df['result_date'].apply(lambda x: str(x.date()))
1 loop, best of 3: 740 ms per loop
In [218]: %timeit df['result_date'].apply(lambda x: x.date())
1 loop, best of 3: 559 ms per loop

编辑:

我认为floornormalize更快:

#home's notebook, so different times as above
In [3]: %timeit df['result_date'].dt.date
1 loop, best of 3: 854 ms per loop
In [4]: %timeit df['result_date'].dt.normalize()
10 loops, best of 3: 27.8 ms per loop
In [5]: %timeit df['result_date'].dt.floor('D')
100 loops, best of 3: 13.1 ms per loop
In [6]: %timeit df.groupby(df['result_date'].dt.date).size()
1 loop, best of 3: 883 ms per loop
In [7]: %timeit df.groupby(df['result_date'].dt.normalize()).size()
10 loops, best of 3: 40.2 ms per loop
In [8]: %timeit df.groupby(df['result_date'].dt.floor('D')).size()
10 loops, best of 3: 25.9 ms per loop

编辑1:

numpy替代方案更快,但正如杰夫指出的那样:

它更快,但您会丢失时区和任何更高级别的方法。

In [9]: %timeit df['result_date'].values.astype('datetime64[D]')
100 loops, best of 3: 2.39 ms per loop

最新更新