转置和组合熊猫的数据范围



假设我有格式的数据

option,subcase,prop1,prop2,prop3,...

.csv

给出

我现在想为每个option创建统计信息,并为每个subcase独立创建其他统计信息。

如果我只想打印全部,并且对置信区间不感兴趣,那么看起来可能有点像这样:

import numpy as np
import pandas as pd
import sys
df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content
options = df.option.unique()
option_data = {}
subcases = df.subcase.unique()
data = {}
for o in options:
    option_data[o] = df[df.option.apply(lambda row: o in row)]
    print(o)
    print(pd.DataFrame.describe(option_data[o]))
    for s in subcases:
        label = o + '_' + s
        data[label] = option_data[o][option_data[o].subcase.apply(lambda row: s in row)]        
        print(label)
        print(pd.DataFrame.describe(data[label]))
    print()

但是,这很难阅读。

如何最好地结合数据帧S.T.我最终得到了

之类的框架
prop1    mean    std    min    25%    ...
A
A_a
A_b
A_c
B
B_a
B_c
...
prop2    mean    std    min    25%    ...
A
A_a
A_b
A_c
B
B_a
B_c
...

我的意思是,i 可以手动循环通过所有框架...但是必须有更有效的东西。

编辑

,例如

option,subcase,cost,time
A,sub1,4,3
A,sub1,2,0
A,sub2,3,8
A,sub2,1,2
B,sub1,13,0
B,sub1,11,0
B,sub2,5,2
B,sub2,3,4

应该产生两个帧:

费用

,mean,std,min,25%,50%,75%,max
A,2.5,1.290994,1,1.75,2.5,3.25,4
A_sub1,3,1.414214,2,2.5,3,3.5,4
A_sub2,2,1.414214,1,1.5,2,2.5,3
B,8,4.760952,3,4.5,8,11.5,13
B_sub1,12,1.414214,11,11.5,12,12.5,13
B_sub2,4,1.414214,3,3.5,4,4.5,5

时间

,mean,std,min,25%,50%,75%,max
A,3.25,3.40343,0,1.5,2.5,4.25,8
A_sub1,1.5,2.12132,0,0.75,1.5,2.25,3
A_sub2,5,4.242641,2,3.5,5,6.5,8
B,1.5,1.914854,0,0,1,2.5,4
B_sub1,0,0,0,0,0,0,0
B_sub2,3,1.414214,2,2.5,3,3.5,4 

基于所有相应子案例的条目计算AB行的条目。

In [79]: df.groupby(['option','subcase']).describe()
Out[79]:
                cost                                                time
               count  mean       std   min   25%   50%   75%   max count mean       std  min   25%  50%   75%  max
option subcase
A      sub1      2.0   3.0  1.414214   2.0   2.5   3.0   3.5   4.0   2.0  1.5  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.5   2.0   2.5   3.0   2.0  5.0  4.242641  2.0  3.50  5.0  6.50  8.0
B      sub1      2.0  12.0  1.414214  11.0  11.5  12.0  12.5  13.0   2.0  0.0  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.5   4.0   4.5   5.0   2.0  3.0  1.414214  2.0  2.50  3.0  3.50  4.0

更新:

In [97]: r = df.groupby(['option','subcase']).describe()
In [100]: t = df.groupby('option').describe().set_index(np.array([''] * df['option'].nunique()), append=True)
In [101]: r.append(t).sort_index()
Out[101]:
                cost                                                  time
               count  mean       std   min    25%   50%    75%   max count  mean       std  min   25%  50%   75%  max
option subcase
A                4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0   4.0  3.25  3.403430  0.0  1.50  2.5  4.25  8.0
       sub1      2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0   2.0  1.50  2.121320  0.0  0.75  1.5  2.25  3.0
       sub2      2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0   2.0  5.00  4.242641  2.0  3.50  5.0  6.50  8.0
B                4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0   4.0  1.50  1.914854  0.0  0.00  1.0  2.50  4.0
       sub1      2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0   2.0  0.00  0.000000  0.0  0.00  0.0  0.00  0.0
       sub2      2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0   2.0  3.00  1.414214  2.0  2.50  3.0  3.50  4.0

使用pd.concat

df1=df.groupby('option').cost.describe()
df2=df.groupby(['option','subcase']).cost.describe()
df2.index=df2.index.map('_'.join)
pd.concat([df1,df2]).sort_index()

Out[256]: 
        count  mean       std   min    25%   50%    75%   max
A         4.0   2.5  1.290994   1.0   1.75   2.5   3.25   4.0
A_sub1    2.0   3.0  1.414214   2.0   2.50   3.0   3.50   4.0
A_sub2    2.0   2.0  1.414214   1.0   1.50   2.0   2.50   3.0
B         4.0   8.0  4.760952   3.0   4.50   8.0  11.50  13.0
B_sub1    2.0  12.0  1.414214  11.0  11.50  12.0  12.50  13.0
B_sub2    2.0   4.0  1.414214   3.0   3.50   4.0   4.50   5.0

最新更新