如何使用 R 汇总数据统计信息



如何编写一个简短的脚本来创建一个新的数据框,该数据框为以下调查的每一列连续数据报告以下描述性统计数据:平均值、标准差、中位数、最小值、最大值、样本量?

   Distance Age Height Coning
1      21.4  18    3.3    Yes
2      13.9  17    3.4    Yes
3      23.9  16    2.9    Yes
4       8.7  18    3.6     No
5     241.8   6    0.7     No
6      44.5  17    1.3    Yes
7      30.0  15    2.5    Yes
8      32.3  16    1.8    Yes
9      31.4  17    5.0     No
10     32.8  13    1.6     No
11     53.3  12    2.0     No
12     54.3   6    0.9     No
13     96.3  11    2.6     No
14    133.6   4    0.6     No
15     32.1  15    2.3     No
16     57.9  12    2.4    Yes
17     30.8  17    1.8     No
18     59.9   7    0.8     No
19     42.7  15    2.0    Yes
20     20.6  18    1.7    Yes
21     62.0   8    1.3     No
22     53.1   7    1.6     No
23     28.9  16    2.2    Yes
24    177.4   5    1.1     No
25     24.8  14    1.5    Yes
26     75.3  14    2.3    Yes
27     51.6   7    1.4     No
28     36.1   9    1.1     No
29    116.1   6    1.1     No
30     28.1  16    2.5    Yes
31      8.7  19    2.2    Yes
32    105.1   6    0.8     No
33     46.0  15    3.0    Yes
34    102.6   7    1.2     No
35     15.8  15    2.2     No
36     60.0   7    1.3     No
37     96.4  13    2.6     No
38     24.2  14    1.7     No
39     14.5  15    2.4     No
40     36.6  14    1.5     No
41     65.7   5    0.6     No
42    116.3   7    1.6     No
43    113.6   8    1.0     No
44     16.7  15    4.3    Yes
45     66.0   7    1.0     No
46     60.7   7    1.0     No
47     90.6   7    0.7     No
48     91.3   7    1.3     No
49     14.4  18    3.1    Yes
50     72.8  14    3.0    Yes

你可以编写自己的函数,将这样的摘要放入 data.frame 中:

# Defining the function
my.summary <- function(x, na.rm=TRUE){
  result <- c(Mean=mean(x, na.rm=na.rm),
              SD=sd(x, na.rm=na.rm),
              Median=median(x, na.rm=na.rm),
              Min=min(x, na.rm=na.rm),
              Max=max(x, na.rm=na.rm), 
              N=length(x))
}
# identifying numeric columns
ind <- sapply(df, is.numeric)

# applying the function to numeric columns only
sapply(df[, ind], my.summary)  
        Distance       Age     Height
Mean    58.67200 11.840000  1.9160000
SD      45.48137  4.604168  0.9796626
Median  48.80000 13.500000  1.7000000
Min      8.70000  4.000000  0.6000000
Max    241.80000 19.000000  5.0000000
N       50.00000 50.000000 50.0000000

或者,您可以使用 fBasics 包中的内置函数basicStats来获得更详细的摘要:

> library(fBasics)
> basicStats(df[, ind])
               Distance        Age    Height
nobs          50.000000  50.000000 50.000000
NAs            0.000000   0.000000  0.000000
Minimum        8.700000   4.000000  0.600000
Maximum      241.800000  19.000000  5.000000
1. Quartile   28.300000   7.000000  1.125000
3. Quartile   74.675000  15.750000  2.475000
Mean          58.672000  11.840000  1.916000
Median        48.800000  13.500000  1.700000
Sum         2933.600000 592.000000 95.800000
SE Mean        6.432037   0.651128  0.138545
LCL Mean      45.746337  10.531510  1.637583
UCL Mean      71.597663  13.148490  2.194417
Variance    2068.555118  21.198367  0.959739
Stdev         45.481371   4.604168  0.979663
Skewness       1.711028  -0.158853  0.905415
Kurtosis       3.753948  -1.574527  0.578684

下面对do.callrbindsapply的使用为具有类"numeric"的每一列提供了摘要。如果您需要与summary不同的统计,您可以编写自己的统计函数(请参阅 @Jilber 的答案)。

mtcars$carb = as.factor(mtcars$carb)  # Forcing one column to a factor
do.call('rbind', sapply(mtcars, function(x) if(is.numeric(x)) summary(x)))
       Min. 1st Qu.  Median     Mean 3rd Qu.    Max.
mpg  10.400  15.420  19.200  20.0900   22.80  33.900
cyl   4.000   4.000   6.000   6.1880    8.00   8.000
disp 71.100 120.800 196.300 230.7000  326.00 472.000
hp   52.000  96.500 123.000 146.7000  180.00 335.000
drat  2.760   3.080   3.695   3.5970    3.92   4.930
wt    1.513   2.581   3.325   3.2170    3.61   5.424
qsec 14.500  16.890  17.710  17.8500   18.90  22.900
vs    0.000   0.000   0.000   0.4375    1.00   1.000
am    0.000   0.000   0.000   0.4062    1.00   1.000
gear  3.000   3.000   4.000   3.6880    4.00   5.000
以下是

一些使用 data.table 的示例。我正在使用前面答案中定义的函数。

my.summary <- function(x, na.rm=TRUE){
  result <- c(Mean=mean(x, na.rm=na.rm),
              SD=sd(x, na.rm=na.rm),
              Median=median(x, na.rm=na.rm),
              Min=min(x, na.rm=na.rm),
              Max=max(x, na.rm=na.rm), 
              N=length(x))
}
set.seed(123)
df <- data.frame(id = 1:1000,
                 Distance = rnorm(1000, 50, 100),
                 Age = rnorm(1000, 50, 100),
                 Height = rnorm(1000, 50, 100)
                 )
df$Coning <- as.factor(ifelse(df$Distance > 0, "Yes", "No"))
library(fBasics)
library(data.table)
DT <- data.table(df)
setkey(DT, id)

按因子变量"圆锥"分组

DT[,lapply(.SD,my.summary),by="Coning"]

使用 my.summary() 和 basicStats()只是数字变量

DT[,lapply(.SD, my.summary),, .SDcols = names(DT)[2:4]]
BS <- DT[,sapply(.SD, basicStats),, .SDcols = names(DT)[2:4]]
BS[, summary := znames]
setnames(BS, 1:3, names(DT)[2:4])
BS
DT[,lapply(.SD, summary),, .SDcols = names(DT)[2:4]]

使用摘要()数值变量使用

DT[,sapply(.SD, function(x) if(is.numeric(x)) summary(x)),, .SDcols = names(DT)[2:4]]

因子变量

DT[,sapply(.SD, function(x) if(is.factor(x)) summary(x)),, .SDcols = names(DT)[5]]

使用分位数函数也非常有用:

DT[,sapply(.SD, function(x) if(is.numeric(x)) quantile(x)),, .SDcols = names(DT)[2:4]]

> 包collapse提供快速高效的汇总统计生成器,qsu。我一直在寻找类似于 STATA su 的 R 函数,这个函数对我来说是最好的。

https://sebkrantz.github.io/collapse/articles/collapse_intro.html

最新更新