所以我想创建一个表,显示平均值、标准偏差,并统计导入的CSV数据文件中两个变量的任何缺失值。csv文件如下所示:
Group Var1 Var2
1 10 100
1 NA 200
2 30 NA
2 40 NA
3 50 500
3 60 600
因此,我的程序将导入这个CSV文件,然后使用pandas生成一个表,显示平均值、标准偏差和缺失值,同时按组号聚合它们。我正在寻找这样的输出:
Variables Missing Values Group 1 Group 2 Group 3
Var1 1 mean1(sd1) mean2(sd2) mean3(sd3)
Var2 2 mean1(sd1) mean2(sd2) mean3(sd3)
使用:
df = pd.DataFrame({'Group':[1, 1, 2, 2, 2],
'Var1':[20, np.NaN, 30, 20, np.NaN],
'Var2':[np.NaN, np.NaN, 30, 20, 3]})
df
Group Var1 Var2
0 1 20.0 NaN
1 1 NaN NaN
2 2 30.0 30.0
3 2 20.0 20.0
4 2 NaN 3.0
miss = len(df) - df.describe().iloc[0,1:]
dfn = df.groupby(['Group'])['Var1','Var2'].apply(lambda x: x.describe()[1:3]).T#.add_suffix('_')
dfn['missing values'] = miss
dfn
输出:
Group 1 2 missing values
mean std mean std
Var1 20.0 NaN 25.000000 7.071068 2.0
Var2 NaN NaN 17.666667 13.650397 2.0
您可以使用以下代码
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.DataFrame([
... [1, 10, 100],
... [1, np.nan, 200],
... [2, 30, np.nan],
... [2, 40, np.nan],
... [3, 50, 500],
... [3, 60, 600]])
>>>
>>> df.columns = ["Group", "Var1", "Var2"]
>>>
>>> groupCol = "Group"
>>> nan_df = df.isna().groupby(groupCol).sum().transpose()
>>> nan_df.columns = ['Missing Values']
>>> std_df = df.groupby(groupCol).std().round(3).transpose()
>>> mean_df = df.groupby(groupCol).mean().round(3).transpose()
>>> # get mean and standard deviation into one column
>>> for i in range(len(mean_df.columns)):
... mean_df.loc[:, mean_df.columns[i]] = mean_df[mean_df.columns[i]].astype(str)+'('+std_df[std_df.columns[i]].astype(str)+')'
...
>>> # change the column names
>>> mean_df.columns = ["Group "+ str(each_group) for each_group in mean_df.columns]
>>> # add missing value data
>>> mean_df = mean_df.join(nan_df)
>>> mean_df
Group 1 Group 2 Group 3 Missing Values
Var1 10.0(nan) 35.0(7.071) 55.0(7.071) 1
Var2 150.0(70.711) nan(nan) 550.0(70.711) 2
>>>
上述操作,都使用了数据帧的groupby
方法。只需一点操作,就可以轻松地获得所需格式的数据。