从unix时间戳创建DatetimeIndex并添加本地时区的性能



panda版本0.14.1

我做以下事情:

import numpy as np
import dateutil
from pandas import DataFrame, DatetimeIndex
import time
cur_size = 1000000
columns = ['A', 'B', 'C', 'D', 'E', 'F']
mdf = np.empty(shape=(cur_size, len(columns)), dtype=np.float32)
idf = np.empty(cur_size,dtype=np.int64)
idf = xrange(1213424324300000000,1213424324300000000+cur_size*1000000, 1000000)
# fill in mdf,idf
index = DatetimeIndex(idf).tz_localize('UTC').tz_convert(dateutil.tz.tzlocal())
frame = DataFrame(mdf, columns = columns, index = index)

所有这些都很快,直到我尝试在框架中添加新的列,例如:

start = time.time()
frame['dfd'] = 0
print 'took', time.time()-start

这花了很长时间(花了10.59秒),但这只是第一次,以后添加更多列又很快。Profiler显示熊猫做了一些非常奇怪的事情,比如时区转换没有发生:

   4275752 function calls (4275746 primitive calls) in 6.461 seconds
   Ordered by: internal time
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.503    6.503 string:2(<module>)
        1    0.000    0.000    6.503    6.503 frame.py:1994(__setitem__)
        1    0.000    0.000    6.499    6.499 indexing.py:1520(_convert_to_index_sliceable)
        1    0.000    0.000    6.499    6.499 index.py:1299(_get_string_slice)
     10/4    0.000    0.000    6.499    1.625 {getattr}
        1    0.001    0.001    6.499    6.499 index.py:1414(inferred_freq)
        1    0.000    0.000    6.498    6.498 frequencies.py:626(infer_freq)
        1    0.000    0.000    6.490    6.490 frequencies.py:694(__init__)
        1    0.000    0.000    6.489    6.489 frequencies.py:669(_tz_convert_with_transitions)
        1    0.006    0.006    6.489    6.489 function_base.py:1660(__call__)
        1    0.234    0.234    6.483    6.483 function_base.py:1746(_vectorize_call)
   534416    0.220    0.000    6.217    0.000 frequencies.py:676(<lambda>)
   534416    3.741    0.000    5.997    0.000 {pandas.tslib.tz_convert_single}
   534417    0.295    0.000    1.863    0.000 tz.py:107(utcoffset)
   534417    0.792    0.000    1.568    0.000 tz.py:123(_isdst)
   534417    0.701    0.000    0.701    0.000 {time.localtime}
   534417    0.232    0.000    0.393    0.000 tz.py:154(__eq__)
   534470    0.161    0.000    0.161    0.000 {isinstance}
   534417    0.074    0.000    0.074    0.000 {method 'toordinal' of 'datetime.date' objects}
       20    0.032    0.002    0.032    0.002 {numpy.core.multiarray.array}
        1    0.000    0.000    0.009    0.009 frequencies.py:716(get_freq)
        1    0.000    0.000    0.009    0.009 frequencies.py:708(deltas)

这在master/0.15.0中得到了修复(将于2014年10月初发布)。这是我记忆中最接近的问题:https://github.com/pydata/pandas/pull/7798.

他们有很多与DST转换检查相关的修复程序(这是此处问题的根源),请参阅此处0.15.0的新增功能。

最新更新