假设我有格式的数据
option,subcase,prop1,prop2,prop3,...
以.csv
我现在想为每个option
创建统计信息,并为每个subcase
独立创建其他统计信息。
如果我只想打印全部,并且对置信区间不感兴趣,那么看起来可能有点像这样:
import numpy as np
import pandas as pd
import sys
df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content
options = df.option.unique()
option_data = {}
subcases = df.subcase.unique()
data = {}
for o in options:
option_data[o] = df[df.option.apply(lambda row: o in row)]
print(o)
print(pd.DataFrame.describe(option_data[o]))
for s in subcases:
label = o + '_' + s
data[label] = option_data[o][option_data[o].subcase.apply(lambda row: s in row)]
print(label)
print(pd.DataFrame.describe(data[label]))
print()
但是,这很难阅读。
如何最好地结合数据帧S.T.我最终得到了
之类的框架prop1 mean std min 25% ...
A
A_a
A_b
A_c
B
B_a
B_c
...
prop2 mean std min 25% ...
A
A_a
A_b
A_c
B
B_a
B_c
...
我的意思是,i 可以手动循环通过所有框架...但是必须有更有效的东西。
编辑
,例如
option,subcase,cost,time
A,sub1,4,3
A,sub1,2,0
A,sub2,3,8
A,sub2,1,2
B,sub1,13,0
B,sub1,11,0
B,sub2,5,2
B,sub2,3,4
应该产生两个帧:
费用
,mean,std,min,25%,50%,75%,max
A,2.5,1.290994,1,1.75,2.5,3.25,4
A_sub1,3,1.414214,2,2.5,3,3.5,4
A_sub2,2,1.414214,1,1.5,2,2.5,3
B,8,4.760952,3,4.5,8,11.5,13
B_sub1,12,1.414214,11,11.5,12,12.5,13
B_sub2,4,1.414214,3,3.5,4,4.5,5
和
时间
,mean,std,min,25%,50%,75%,max
A,3.25,3.40343,0,1.5,2.5,4.25,8
A_sub1,1.5,2.12132,0,0.75,1.5,2.25,3
A_sub2,5,4.242641,2,3.5,5,6.5,8
B,1.5,1.914854,0,0,1,2.5,4
B_sub1,0,0,0,0,0,0,0
B_sub2,3,1.414214,2,2.5,3,3.5,4
基于所有相应子案例的条目计算A
和B
行的条目。
In [79]: df.groupby(['option','subcase']).describe()
Out[79]:
cost time
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
option subcase
A sub1 2.0 3.0 1.414214 2.0 2.5 3.0 3.5 4.0 2.0 1.5 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 2.0 2.0 1.414214 1.0 1.5 2.0 2.5 3.0 2.0 5.0 4.242641 2.0 3.50 5.0 6.50 8.0
B sub1 2.0 12.0 1.414214 11.0 11.5 12.0 12.5 13.0 2.0 0.0 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 2.0 4.0 1.414214 3.0 3.5 4.0 4.5 5.0 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0
更新:
In [97]: r = df.groupby(['option','subcase']).describe()
In [100]: t = df.groupby('option').describe().set_index(np.array([''] * df['option'].nunique()), append=True)
In [101]: r.append(t).sort_index()
Out[101]:
cost time
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
option subcase
A 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0 4.0 3.25 3.403430 0.0 1.50 2.5 4.25 8.0
sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0 2.0 1.50 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0 2.0 5.00 4.242641 2.0 3.50 5.0 6.50 8.0
B 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0 4.0 1.50 1.914854 0.0 0.00 1.0 2.50 4.0
sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0 2.0 0.00 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0 2.0 3.00 1.414214 2.0 2.50 3.0 3.50 4.0
使用pd.concat
df1=df.groupby('option').cost.describe()
df2=df.groupby(['option','subcase']).cost.describe()
df2.index=df2.index.map('_'.join)
pd.concat([df1,df2]).sort_index()
Out[256]:
count mean std min 25% 50% 75% max
A 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0
A_sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0
A_sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0
B 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0
B_sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0
B_sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0