我需要在时间序列中进行一个简单的协方差分析。我的原始数据是这样的:
WEEK_END_DATE TITLE_SHORT SALES
2012-02-25 00:00:00.000000 "Bob" (EBK) 1
"Bob" (EBK) 1
2012-03-31 00:00:00.000000 "Bob" (EBK) 1
"Bob" (EBK) 1
2012-03-03 00:00:00.000000 "Sally" (EBK) 1
2012-03-10 00:00:00.000000 "Sally" (EBK) 1
2012-03-17 00:00:00.000000 "Sally" (EBK) 1
"Sally" (EBK) 1
2012-04-07 00:00:00.000000 "Sally" (EBK) 1
如您所见,有一些重复。除非我遗漏了什么,否则我需要这些数据成为每个标题的一组向量,这样我就可以使用numpy.cov。
问题:
如何查找日期和名称中的重复项并按总和汇总它们?我一直在尝试使用pandas groupby WEEK_END_DATE和TITTLE_SHORT,但它以一种我不理解的方式索引出来。
编辑:具体来说,当我尝试df.groupby(["WEEK_END_DATE", "TITLE_SHORT"])
时,我得到这个:
>df.ix[0:3]
WEEK_END_DATE TITLE_SHORT
2012-02-04 00:00:00.000000 'SALEM'S LOT (EBK) <pandas.core.indexing._NDFrameIndexer object a...
'TIS THE SEASON! (EBK) <pandas.core.indexing._NDFrameIndexer object a...
(NOT THAT YOU ASKED) (EBK) <pandas.core.indexing._NDFrameIndexer object a...
dtype: object
和尝试选择df.ix[1,]
得到这个错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/series.py", line 613, in __getitem__
return self.index.get_value(self, key)
File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 1630, in get_value
loc = self.get_loc(key)
File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2285, in get_loc
result = slice(*self.slice_locs(key, key))
File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2226, in slice_locs
start_slice = self._partial_tup_index(start, side='left')
File "/Library/Python/2.7/site-packages/pandas-0.11.0rc1_20130415-py2.7-macosx-10.8-intel.egg/pandas/core/index.py", line 2250, in _partial_tup_index
raise Exception('Level type mismatch: %s' % lab)
Exception: Level type mismatch: 3
我不完全确定我知道发生了什么,但我想从这里开始。首先,获取数据(对我来说是固定宽度):
>>> df = pd.read_fwf("weekend.dat", widths=(26, 20, 9), parse_dates=[0])
>>> df = df.fillna(method="ffill")
>>> df
WEEK_END_DATE TITLE_SHORT SALES
0 2012-02-25 00:00:00 "Bob" (EBK) 1
1 2012-02-25 00:00:00 "Bob" (EBK) 1
2 2012-03-31 00:00:00 "Bob" (EBK) 1
3 2012-03-31 00:00:00 "Bob" (EBK) 1
4 2012-03-03 00:00:00 "Sally" (EBK) 1
5 2012-03-10 00:00:00 "Sally" (EBK) 1
6 2012-03-17 00:00:00 "Sally" (EBK) 1
7 2012-03-17 00:00:00 "Sally" (EBK) 1
8 2012-04-07 00:00:00 "Sally" (EBK) 1
然后聚合dup:
>>> g = df.groupby(["WEEK_END_DATE", "TITLE_SHORT"]).sum().reset_index()
>>> g
WEEK_END_DATE TITLE_SHORT SALES
0 2012-02-25 00:00:00 "Bob" (EBK) 2
1 2012-03-03 00:00:00 "Sally" (EBK) 1
2 2012-03-10 00:00:00 "Sally" (EBK) 1
3 2012-03-17 00:00:00 "Sally" (EBK) 2
4 2012-03-31 00:00:00 "Bob" (EBK) 2
5 2012-04-07 00:00:00 "Sally" (EBK) 1
然后做任何cov
的东西,你需要(注意,cov
是一个系列/DataFrame/GroupBy方法太,所以你不应该需要特别调用np.cov
)。