将时间表操作应用于多指数数据框时,行为不一致



这可能是一个潜在的错误:进行分组的时间工作操作会在多索引数据框架上静静地失败。

import pandas as pd
import pandas.io.data as web
# Get some market data
df = web.DataReader(['AAPL', 'GOOG'], 'yahoo', pd.Timestamp('2013'), pd.Timestamp('2014')).to_frame()
df.index.names = ('dt', 'symbol')
In [21]: df.head()
Out[21]: 
                        Open       High        Low      Close     Volume  
dt         symbol                                                          
2013-01-02 AAPL    553.82001  555.00000  541.62994  549.03003  140129500   
2013-01-03 AAPL    547.88000  549.67004  541.00000  542.10004   88241300   
2013-01-04 AAPL    536.96997  538.63000  525.82996  527.00000  148583400   
2013-01-07 AAPL    522.00000  529.30005  515.20001  523.90002  121039100   
2013-01-08 AAPL    529.21002  531.89001  521.25000  525.31000  114676800   
                   Adj Close  
dt         symbol             
2013-01-02 AAPL     74.63931  
2013-01-03 AAPL     73.69719  
2013-01-04 AAPL     71.64438  
2013-01-07 AAPL     71.22294  
2013-01-08 AAPL     71.41463  

假设我们想将其重新置于每月数据。这将失败并返回一个空框架:

df_M = df.groupby(level='symbol').resample('M', how='mean')
In [23]: df_M
Out[23]: 
Empty DataFrame
Columns: []
Index: []

但是,这有效,但需要一个看似无需的重新索引:

df_M = df.reset_index().set_index('dt').groupby('symbol').resample('M', how='mean')
In [26]: df_M.head()
Out[26]: 
                   Adj Close       Close        High         Low        Open  
symbol dt                                                                      
AAPL   2013-01-31  67.677750  497.822382  504.407623  492.969997  500.083329   
       2013-02-28  62.388477  456.808942  463.231056  452.106325  458.503692   
       2013-03-31  60.417287  441.841000  446.803495  437.337996  442.011512   
       2013-04-30  57.398619  419.765001  425.553183  414.722271  419.766820   
       2013-05-31  61.340151  446.452734  451.658190  441.495455  446.400919   
                         Volume  
symbol dt                        
AAPL   2013-01-31  1.562312e+08  
       2013-02-28  1.229478e+08  
       2013-03-31  1.147110e+08  
       2013-04-30  1.245851e+08  
       2013-05-31  1.073583e+08  

您需要执行reset_index().set_index('dt'),然后groupby('symbol')而不是groupby(level='symbol')的事实似乎打败了多指数的目的!什么给?

我还意识到,这样的数据也许比数据框更适合面板,但是在处理大量(通常很稀疏)数据时,3D面板结构与平面相比提出了性能和内存问题dataFrame。

这确实是一个错误,已经修复了:https://github.com/pydata/pandas/pandas/issues/10063

最新更新