wild Pandas数据框索引类型转换



我正在从HDF5文件中提取熊猫dataframe并对它们进行分析。由于某种原因,其中一个dataframe的索引在应用过滤器后进行了看似随机的类型转换:

(Pdb) ORD_ticks
                          Codes    Price  Size
Time
2015-02-12 11:35:28-05:00    OC  148.200     0
2015-02-12 14:51:25-05:00    OC  148.870     0
2015-02-12 14:55:21-05:00    OC  146.550     0
2015-02-12 14:55:57-05:00    OC  148.230     0
2015-02-12 14:58:27-05:00    OC  148.542     0
2015-02-12 15:01:28-05:00    OC  148.200     0
2015-02-12 15:07:32-05:00    OC  148.400     0
...                         ...      ...   ...
2015-05-19 11:35:14-04:00    OC  152.000     0
2015-05-19 14:51:26-04:00    OC  151.980     0
2015-05-19 14:55:21-04:00    OC  151.500     0
2015-05-19 14:55:56-04:00    OC  151.800     0
2015-05-19 14:58:32-04:00    OC  151.966     0
2015-05-19 15:01:32-04:00    OC  152.110     0
2015-05-19 15:07:39-04:00    OC  152.000     0
[462 rows x 3 columns]
(Pdb) type(ORD_ticks.index)
<class 'pandas.tseries.index.DatetimeIndex'>

然后对ORD_ticks应用以下过滤器以得到ORD_prices:

    ORD_prices = ORD_ticks.ix[indicator.index.map(lambda t: ORD_ticks.index.asof(t)).tolist()].groupby(level=0).last()

之后,ORD_prices看起来像这样:

(Pdb) ORD_prices
             Codes   Price  Size
1.423772e+18    OC  148.40     0
1.423858e+18    OC  148.29     0
1.424204e+18    OC  146.15     0
1.424290e+18    OC  146.51     0
1.424376e+18    OC  146.22     0
1.424463e+18    OC  145.08     0
1.424722e+18    OC  147.72     0
...            ...     ...   ...
1.431371e+18    OC  149.95     0
1.431458e+18    OC  145.58     0
1.431544e+18    OC  145.22     0
1.431630e+18    OC  148.01     0
1.431717e+18    OC  148.91     0
1.431976e+18    OC  148.89     0
1.432062e+18    OC  152.00     0
[63 rows x 3 columns]
(Pdb) type(ORD_prices.index)
<class 'pandas.core.index.Float64Index'>

奇怪的是,我对大约100个不同的数据集做了完全相同的操作,这只发生在这个数据集上!这是怎么呢

这是indicator:

(Pdb) indicator
Empty DataFrame
Columns: []
Index: [2015-02-09 15:30:00-05:00, 2015-02-10 15:30:00-05:00, 2015-02-11 15:30:0
0-05:00, 2015-02-12 15:30:00-05:00, 2015-02-13 15:30:00-05:00, 2015-02-17 15:30:
00-05:00, 2015-02-18 15:30:00-05:00, 2015-02-19 15:30:00-05:00, 2015-02-20 15:30
:00-05:00, 2015-02-23 15:30:00-05:00, 2015-02-24 15:30:00-05:00, 2015-02-25 15:3
0:00-05:00, 2015-02-26 15:30:00-05:00, 2015-02-27 15:30:00-05:00, 2015-03-02 15:
30:00-05:00, 2015-03-03 15:30:00-05:00, 2015-03-04 15:30:00-05:00, 2015-03-05 15
:30:00-05:00, 2015-03-06 15:30:00-05:00, 2015-03-09 15:30:00-04:00, 2015-03-10 1
5:30:00-04:00, 2015-03-11 15:30:00-04:00, 2015-03-12 15:30:00-04:00, 2015-03-13
15:30:00-04:00, 2015-03-16 15:30:00-04:00, 2015-03-17 15:30:00-04:00, 2015-03-18
 15:30:00-04:00, 2015-03-19 15:30:00-04:00, 2015-03-20 15:30:00-04:00, 2015-03-2
3 15:30:00-04:00, 2015-03-24 15:30:00-04:00, 2015-03-25 15:30:00-04:00, 2015-03-
26 15:30:00-04:00, 2015-03-27 15:30:00-04:00, 2015-03-30 15:30:00-04:00, 2015-03
-31 15:30:00-04:00, 2015-04-01 15:30:00-04:00, 2015-04-07 15:30:00-04:00, 2015-0
4-08 15:30:00-04:00, 2015-04-09 15:30:00-04:00, 2015-04-10 15:30:00-04:00, 2015-
04-13 15:30:00-04:00, 2015-04-14 15:30:00-04:00, 2015-04-15 15:30:00-04:00, 2015
-04-16 15:30:00-04:00, 2015-04-17 15:30:00-04:00, 2015-04-20 15:30:00-04:00, 201
5-04-21 15:30:00-04:00, 2015-04-22 15:30:00-04:00, 2015-04-23 15:30:00-04:00, 20
15-04-24 15:30:00-04:00, 2015-04-27 15:30:00-04:00, 2015-04-28 15:30:00-04:00, 2
015-04-29 15:30:00-04:00, 2015-05-04 15:30:00-04:00, 2015-05-05 15:30:00-04:00,
2015-05-06 15:30:00-04:00, 2015-05-07 15:30:00-04:00, 2015-05-08 15:30:00-04:00,
 2015-05-11 15:30:00-04:00, 2015-05-12 15:30:00-04:00, 2015-05-13 15:30:00-04:00
, 2015-05-14 15:30:00-04:00, 2015-05-15 15:30:00-04:00, 2015-05-18 15:30:00-04:0
0, 2015-05-19 15:30:00-04:00]

使用.reindex(method='nearest')与您所做的相同(但更快)。这需要0.16.0.

In [41]: df = DataFrame({'A' : range(10) },index=pd.date_range('20130101',freq='2S',periods=10,tz='US/Eastern'))
In [42]: df
Out[42]: 
                           A
2013-01-01 00:00:00-05:00  0
2013-01-01 00:00:02-05:00  1
2013-01-01 00:00:04-05:00  2
2013-01-01 00:00:06-05:00  3
2013-01-01 00:00:08-05:00  4
2013-01-01 00:00:10-05:00  5
2013-01-01 00:00:12-05:00  6
2013-01-01 00:00:14-05:00  7
2013-01-01 00:00:16-05:00  8
2013-01-01 00:00:18-05:00  9
In [43]: idx = pd.date_range('20130101 00:00:00',periods=20,freq='5s',tz='US/Eastern')
In [44]: df.reindex(idx,method='nearest')
Out[44]: 
                           A
2013-01-01 00:00:00-05:00  0
2013-01-01 00:00:05-05:00  3
2013-01-01 00:00:10-05:00  5
2013-01-01 00:00:15-05:00  8
2013-01-01 00:00:20-05:00  9
2013-01-01 00:00:25-05:00  9
2013-01-01 00:00:30-05:00  9
2013-01-01 00:00:35-05:00  9
2013-01-01 00:00:40-05:00  9
2013-01-01 00:00:45-05:00  9
2013-01-01 00:00:50-05:00  9
2013-01-01 00:00:55-05:00  9
2013-01-01 00:01:00-05:00  9
2013-01-01 00:01:05-05:00  9
2013-01-01 00:01:10-05:00  9
2013-01-01 00:01:15-05:00  9
2013-01-01 00:01:20-05:00  9
2013-01-01 00:01:25-05:00  9
2013-01-01 00:01:30-05:00  9
2013-01-01 00:01:35-05:00  9

最新更新