Group by - pandas:不需要的格式结果.我如何groupby().sum()给出表格结构



我已经搜索了熊猫文档,不幸的是,我找不到答案。

实际上,经过一些数据争用之后,我得到了数据框架

    ticker_id          close_date           sector  sector_index
0           1 2014-02-28 00:00:00   Consumer Goods     31.106653
1           1 2014-02-27 00:00:00   Consumer Goods     30.951213
2           2 2014-02-28 00:00:00   Consumer Goods     19.846387
3           2 2014-02-27 00:00:00   Consumer Goods     19.671747
4           3 2014-02-28 00:00:00   Consumer Goods   1208.552000
5           3 2014-02-27 00:00:00   Consumer Goods   1193.352000
6           4 2014-02-28 00:00:00   Consumer Goods      9.893989
7           4 2014-02-27 00:00:00   Consumer Goods      9.857385
8           5 2014-02-28 00:00:00   Consumer Goods     52.196757
9           5 2014-02-27 00:00:00   Consumer Goods     53.101520
10          6 2014-02-28 00:00:00         Services      5.449554
11          6 2014-02-27 00:00:00         Services      5.440019
12          7 2014-02-28 00:00:00  Basic Materials   4149.237000
13          7 2014-02-27 00:00:00  Basic Materials   4130.704000

And I run groupby

df_all2 = df_all.groupby(['close_date','sector']).sum()
print df_all2

结果是这样的

                            ticker_id  sector_index
close_date sector                                  
2014-02-27 Basic Materials          7   4130.704000
           Consumer Goods          15   1306.933865
           Services                 6      5.440019
2014-02-28 Basic Materials          7   4149.237000
           Consumer Goods          15   1321.595786
           Services                 6      5.449554

但是在这种形式下,我不能正确地上传到mysql。因此,为了正确上传到mysql,我需要做这个和其他一些事情。

data2 = list(tuple(x) for x in df_all2.values)

但是data2有无意义的垃圾。

长话短说,我如何让groupby给我以下结果(其中close_date全部正确填写并且列标题是表格)。

close_date sector           ticker_id  sector_index
2014-02-27 Basic Materials          7   4130.704000
2014-02-27 Consumer Goods          15   1306.933865
2014-02-27 Services                 6      5.440019
2014-02-28 Basic Materials          7   4149.237000
2014-02-28 Consumer Goods          15   1321.595786
2014-02-28 Services                 6      5.449554

另外,为了帮助社区,我应该如何修改标题,以便其他遇到这个问题的熊猫用户也能找到你的解决方案?非常感谢您的帮助。

在使用to_sql *:

之前必须在MultiIndex上使用reset_index
In [11]: df.groupby(['close_date','sector']).sum().reset_index()
Out[11]:
  close_date           sector  ticker_id  sector_index
0 2014-02-27  Basic Materials          7   4130.704000
1 2014-02-27   Consumer Goods         15   1306.933865
2 2014-02-27         Services          6      5.440019
3 2014-02-28  Basic Materials          7   4149.237000
4 2014-02-28   Consumer Goods         15   1321.595786
5 2014-02-28         Services          6      5.449554

也可以在groupby中使用as_index=False:

In [12]: df.groupby(['close_date','sector'], as_index=False).sum()
Out[12]:
  close_date           sector  ticker_id  sector_index
0 2014-02-27  Basic Materials          7   4130.704000
1 2014-02-27   Consumer Goods         15   1306.933865
2 2014-02-27         Services          6      5.440019
3 2014-02-28  Basic Materials          7   4149.237000
4 2014-02-28   Consumer Goods         15   1321.595786
5 2014-02-28         Services          6      5.449554

*注意:这应该从0.14起修复,即您应该能够将MultiIndex保存到sql.

请参见如何通过mysqldb将pandas数据框插入数据库。

最新更新