R相当于用于汇总统计信息的"table,contents()"Stata命令



我正在尝试模仿R中的tableStata命令,该命令执行汇总统计表。该命令允许您在生成的单元格中创建具有不同统计信息的交叉表。例如,在下面的例子中,我将三个变量(category1category2category3(交叉,得到metric1的平均值和标准差以及metric2的平均值与标准差作为列向量。

所述行为通过Stata上的以下单行获得。

table category1 category2 category3 ,c(mean metric1 sd metric1 mean metric2 sd metric2) 

期望输出:表格说明

这里得到的交叉表的每个列向量,假设交叉表的X包含X = [mean(metric1),sd(metric1), mean(metric2),sd(metric2)]'

----------------------------------------------------------------------------
|                     category3 and category2                     
| ------------ First -----------    ----------- Second -----------
category1 |      A        B       C   Total         A       B       C   Total
----------+-----------------------------------------------------------------
1 |  mean(metric1)  
|  sd(metric1)  
|  mean(metric2)  
|  sd(metric1)   

所需输出(!(:Stata上的结果表


----------------------------------------------------------------------------
|                     category3 and category2                     
| ------------ First -----------    ----------- Second -----------
category1 |      A       B       C   Total         A       B       C   Total
----------+-----------------------------------------------------------------
1 |  5.778   7.200   2.571   5.048     6.667   3.000   3.000   4.222
|  2.906   3.347   2.507   3.324     2.309   1.414   1.155   2.333
| -1.556  -2.000  -1.143  -1.524    -2.000  -2.000  -3.000  -2.444
|  1.667   0.000   1.069   1.250     0.000   2.828   1.155   1.333
| 
2 |  3.200   6.333   4.200   4.571     4.889   5.000   5.000   4.947
|  2.280   3.445   2.741   2.976     3.180   3.464   2.449   2.857
| -0.800  -2.000  -2.000  -1.714    -2.222  -1.500  -1.000  -1.684
|  1.095   1.265   1.333   1.309     1.563   1.000   1.673   1.529
| 
3 |  8.667   4.667   5.167   5.667     5.667   6.667   6.000   6.000
|  2.309   2.309   2.758   2.849     3.445   4.163   3.464   3.303
| -3.333  -2.667  -2.000  -2.333    -2.333  -2.000  -1.333  -2.000
|  1.155   1.155   1.477   1.414     0.816   2.000   1.155   1.206
| 
Total |  5.529   6.286   4.207   5.067     5.444   5.111   4.615   5.100
|  3.125   3.124   2.795   3.047     3.053   3.333   2.501   2.898
| -1.647  -2.143  -1.793  -1.833    -2.222  -1.778  -1.692  -1.950
|  1.618   0.949   1.346   1.342     1.166   1.563   1.601   1.395
----------------------------------------------------------------------------

生成上述结果的Stata代码

clear all
set obs 100
set seed 777
gen category1 = runiformint(1,3)
gen category2_num = runiformint(1,3)
gen category2 = "A" if category2_num ==1
replace category2 = "B" if category2_num ==2
replace category2 = "C" if category2_num ==3
drop category2_num
gen category3_num = runiformint(1,2)
gen category3 = "First" if category3_num ==1
replace category3 = "Second" if category3_num ==2
drop category3_num
gen metric1 = round(runiform()*10,2)
gen metric2 = round(runiform()*-4,2)
table category1 category2 category3 /// List of the variables that will create the crosstab
,c(mean metric1 sd metric1 /// Mean and std.dev of metric1 as 1st and 2nd rows
mean metric2 sd metric2)   /// Mean and std.dev of metric2 as 3rd and 4th rows
row col                    /// Add the over all statistics total rows and cols
format(%9.3f)              // Decimal style setting.

R尝试

以下是我解决这个问题的方法。然而,我离我想要的结果还很远。尽管我在屏幕上显示了相同的信息,但我在R上显示信息的方式可读性非常差。此外,我还没有计算行和列总数的平均值和标准差,我在Stata输出中也计算过。

最后,在我看来,对于这样一个简单的问题,这个程序似乎是一个过度的解决方案。在我的上下文中,包是允许的,因此,欢迎dplyrdata.table的建议。

包括Stata生成的数据+复制例程

df <- as.data.frame(structure(list(category1 = structure(c(1, 3, 1, 2, 3, 1, 3, 1,1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2, 1,3, 3, 3, 2, 2, 1, 1, 1, 3, 2, 3, 1, 2, 2, 1, 3, 1, 3, 1, 1, 3,1, 1, 2, 1, 3, 2, 2, 3, 3, 3, 1, 2, 3, 2, 3, 2, 1, 1, 1, 2, 2,2, 1, 3, 2, 2, 2, 3, 3), format.stata = "%9.0g"), 
category2 = structure(c("C", "A", "A", "A", "C", "C", "A", "A", "A", "A", "B", "A", "A", "A","A", "B", "A", "C", "C", "B", "C", "A", "A", "C", "A", "B", "C", "B", "C", "C", "A", "C", "B", "B", "A", "B", "C", "A", "B", "B","C", "A", "A", "C", "C", "B", "C", "A", "A", "C", "C", "B", "C", "C", "A", "C", "A", "A", "C", "B", "A", "C", "C", "C", "B", "B","C", "C", "A", "A", "C", "C", "A", "C", "B", "B", "C", "C", "C", "C", "A", "C", "C", "C", "C", "B", "B", "B", "B", "C", "A", "A","C", "C", "A", "A", "A", "B", "B", "C"), format.stata = "%9s"), 
category3 = structure(c("First", "Second", "First", "First", "First", "First", "Second", "Second", "First", "Second", "First", "First", "Second", "Second", "First", "Second", "Second", "First", "Second", "First", "First", "First", "First","Second", "First", "First", "Second", "First", "First", "First","First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First","Second", "Second", "First", "Second", "Second", "Second","First", "First", "First", "Second", "Second", "First", "First","Second", "First", "First", "First", "First", "Second", "First","Second", "Second", "First", "Second", "First", "Second", "First", "Second", "First", "First", "First", "First", "Second","First", "First", "First", "Second", "Second", "First", "First","First", "Second", "First", "Second", "First", "Second","Second", "First", "Second", "First", "First", "Second","Second", "Second", "Second", "First"), format.stata = "%9s"),
metric1 = structure(c(0, 10, 0, 0, 8, 4, 4, 8, 8, 2, 4, 4, 6, 2, 6, 8, 6, 4, 4, 10, 10, 4, 6, 8, 6, 2, 4, 4, 6, 0, 6,0, 10, 8, 2, 2, 2, 0, 2, 10, 2, 8, 4, 6, 8, 2, 2, 6, 0, 2,4, 6, 2, 2, 8, 6, 8, 8, 2, 8, 10, 4, 4, 4, 4, 10, 4, 2, 6,4, 6, 4, 10, 2, 8, 6, 8, 2, 6, 6, 6, 4, 8, 6, 8, 2, 10, 2, 6, 2, 10, 4, 8, 0, 10, 6, 4, 2, 8, 8), format.stata = "%9.0g"),
metric2 = structure(c(0, -4, 0, 0, -2, -2, -2, -2, -4, -2, -2, -2, -2, -4, 0, 0, -2, -2, -4, -2, 0, -2, -4, -2, -2, -2, -2, -2, -4, 0, -4, -4, -2, -2, -2, -2, -2, -2, -4, -2, -2, -2, -2, -2, 0, -2, -4, -4, -2, -2, 0, -4, -2, 0, -2,-2, 0, -2, -4, 0, -2, -2, 0, 0, -4, -4, 0, -2, 0, -2, -2, -4, 0, -2, -2, -2, 0, -2, -2, -2, -2, -2, -2, 0, 0, 0, -2, 0, -2, -4, 0, 0, 0, -2, -4, -4, 0, -2, -2, -4), format.stata = "%9.0g")), 
row.names = c(NA,-100L), class = c("tbl_df", "tbl", "data.frame")))
# expand grid for every possible value
prs <- expand.grid(cat1 = unique(df$category1)   ,
cat2 = unique(df$category2) ,
cat3 = unique(df$category3))
#Number of total combinations 
N <-   nrow(prs)
#Loop over the combinations to get the desired statistis
A <- lapply(1:N, FUN = function(i){
mean1 <- mean(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
sd1   <- sd(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])

mean2 <- mean(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
sd2   <- sd(df[(df$category1 == prs$cat1[i] &  df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])

r_list<- list(cat1 = prs$cat1[i],cat2 = prs$cat2[i], cat3 = prs$cat3[i],
mean1 = mean1,  sd1 = sd1 , mean2 = mean2, sd2 = sd2)
return(r_list)
})
#List to data.frame
df_stats <- do.call(rbind.data.frame, A)

获得的输出(但是,不是我想要的输出(!((

# cat1 cat2   cat3    mean1      sd1     mean2       sd2
# 2     1    C  First 2.571429 2.507133 -1.142857 1.0690450
# 21    3    C  First 5.166667 2.757909 -2.000000 1.4770979
# 3     2    C  First 4.200000 2.740641 -2.000000 1.3333333
# 4     1    A  First 5.777778 2.905933 -1.555556 1.6666667
# 5     3    A  First 8.666667 2.309401 -3.333333 1.1547005
# 6     2    A  First 3.200000 2.280351 -0.800000 1.0954451
# 7     1    B  First 7.200000 3.346640 -2.000000 0.0000000
# 8     3    B  First 4.666667 2.309401 -2.666667 1.1547005
# 9     2    B  First 6.333333 3.444803 -2.000000 1.2649111
# 10    1    C Second 3.000000 1.154701 -3.000000 1.1547005
# 11    3    C Second 6.000000 3.464102 -1.333333 1.1547005
# 12    2    C Second 5.000000 2.449490 -1.000000 1.6733201
# 13    1    A Second 6.666667 2.309401 -2.000000 0.0000000
# 14    3    A Second 5.666667 3.444803 -2.333333 0.8164966
# 15    2    A Second 4.888889 3.179797 -2.222222 1.5634719
# 16    1    B Second 3.000000 1.414214 -2.000000 2.8284271
# 17    3    B Second 6.666667 4.163332 -2.000000 2.0000000
# 18    2    B Second 5.000000 3.464102 -1.500000 1.0000000

您可以使用data.tablemagrittr包,如下所示:

library(magrittr)
library(data.table)
# function to compute the mean and sd
fun <- function(x, y) list(metric1_meam=mean(x), metric1_sd=sd(x), metric2_meam=mean(y), metric2_sd=sd(y))
# compute the Total column, and A,B,C columns of the desired output as follows and bind them 
setDT(df)[, 'category1' := as.character(category1)]
Y <- rbind(
df[, fun(metric1, metric2), by=.(category1, category2, category3)],
df[, fun(metric1, metric2), by=.(category1, category3)][, category2 := 'Total'],
df[, fun(metric1, metric2), by=.(category2, category3)][, category1 := 'Total'],
df[, fun(metric1, metric2), by=.(category3)][, c('category1', 'category2') := 'Total']
)
# generate the desired output
melt(Y, measure=patterns('metric')) %>% 
xtabs(formula = value ~ .) %>% 
ftable(col.vars = c('category3', 'category2'))


category3      First                                      Second                                 
category2          A          B          C      Total          A          B          C      Total
category1 variable                                                                                                      
1         metric1_meam            5.7777778  7.2000000  2.5714286  5.0476190  6.6666667  3.0000000  3.0000000  4.2222222
metric1_sd              2.9059326  3.3466401  2.5071327  3.3237959  2.3094011  1.4142136  1.1547005  2.3333333
metric2_meam           -1.5555556 -2.0000000 -1.1428571 -1.5238095 -2.0000000 -2.0000000 -3.0000000 -2.4444444
metric2_sd              1.6666667  0.0000000  1.0690450  1.2497619  0.0000000  2.8284271  1.1547005  1.3333333
2         metric1_meam            3.2000000  6.3333333  4.2000000  4.5714286  4.8888889  5.0000000  5.0000000  4.9473684
metric1_sd              2.2803509  3.4448028  2.7406406  2.9760952  3.1797973  3.4641016  2.4494897  2.8572264
metric2_meam           -0.8000000 -2.0000000 -2.0000000 -1.7142857 -2.2222222 -1.5000000 -1.0000000 -1.6842105
metric2_sd              1.0954451  1.2649111  1.3333333  1.3093073  1.5634719  1.0000000  1.6733201  1.5294382
3         metric1_meam            8.6666667  4.6666667  5.1666667  5.6666667  5.6666667  6.6666667  6.0000000  6.0000000
metric1_sd              2.3094011  2.3094011  2.7579087  2.8491485  3.4448028  4.1633320  3.4641016  3.3028913
metric2_meam           -3.3333333 -2.6666667 -2.0000000 -2.3333333 -2.3333333 -2.0000000 -1.3333333 -2.0000000
metric2_sd              1.1547005  1.1547005  1.4770979  1.4142136  0.8164966  2.0000000  1.1547005  1.2060454
Total     metric1_meam            5.5294118  6.2857143  4.2068966  5.0666667  5.4444444  5.1111111  4.6153846  5.1000000
metric1_sd              3.1248529  3.1238185  2.7951400  3.0469027  3.0529103  3.3333333  2.5012817  2.8982753
metric2_meam           -1.6470588 -2.1428571 -1.7931034 -1.8333333 -2.2222222 -1.7777778 -1.6923077 -1.9500000
metric2_sd              1.6179144  0.9492623  1.3464055  1.3424827  1.1659662  1.5634719  1.6012815  1.3950462

您可以利用aggregate的力量。

FUN <- function(x) c(mean=mean(x), sd=sd(x))
aggregate(cbind(metric1, metric2) ~ ., df, FUN)
#    category1 category2 category3 metric1.mean metric1.sd metric2.mean metric2.sd
# 1          1         A     First     5.777778   2.905933   -1.5555556  1.6666667
# 2          2         A     First     3.200000   2.280351   -0.8000000  1.0954451
# 3          3         A     First     8.666667   2.309401   -3.3333333  1.1547005
# 4          1         B     First     7.200000   3.346640   -2.0000000  0.0000000
# 5          2         B     First     6.333333   3.444803   -2.0000000  1.2649111
# 6          3         B     First     4.666667   2.309401   -2.6666667  1.1547005
# 7          1         C     First     2.571429   2.507133   -1.1428571  1.0690450
# 8          2         C     First     4.200000   2.740641   -2.0000000  1.3333333
# 9          3         C     First     5.166667   2.757909   -2.0000000  1.4770979
# 10         1         A    Second     6.666667   2.309401   -2.0000000  0.0000000
# 11         2         A    Second     4.888889   3.179797   -2.2222222  1.5634719
# 12         3         A    Second     5.666667   3.444803   -2.3333333  0.8164966
# 13         1         B    Second     3.000000   1.414214   -2.0000000  2.8284271
# 14         2         B    Second     5.000000   3.464102   -1.5000000  1.0000000
# 15         3         B    Second     6.666667   4.163332   -2.0000000  2.0000000
# 16         1         C    Second     3.000000   1.154701   -3.0000000  1.1547005
# 17         2         C    Second     5.000000   2.449490   -1.0000000  1.6733201
# 18         3         C    Second     6.000000   3.464102   -1.3333333  1.1547005         

对于交叉列表,请尝试xtabs

当应用多个函数时,aggregate会生成列中的矩阵(请参阅这个答案,为什么(,所以首先我们要去掉它们。

r <- do.call(data.frame, aggregate(cbind(metric1, metric2) ~ ., df, FUN))

现在我们可以应用xtabs,例如,对于每个类别3。

xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == "First", 1:5])
# , , category3 = First,  = metric1.mean
# 
#          category2
# category1        A        B        C
#         1 5.777778 7.200000 2.571429
#         2 3.200000 6.333333 4.200000
#         3 8.666667 4.666667 5.166667
# 
# , , category3 = First,  = metric1.sd
# 
#           category2
# category1        A        B        C
#         1 2.905933 3.346640 2.507133
#         2 2.280351 3.444803 2.740641
#         3 2.309401 2.309401 2.757909
xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == "Second", 1:5])
# , , category3 = Second,  = metric1.mean
# 
#          category2
# category1        A        B        C
#         1 6.666667 3.000000 3.000000
#         2 4.888889 5.000000 5.000000
#         3 5.666667 6.666667 6.000000
# 
# , , category3 = Second,  = metric1.sd
# 
#          category2
# category1        A        B        C
#         1 2.309401 1.414214 1.154701
#         2 3.179797 3.464102 2.449490
#         3 3.444803 4.163332 3.464102

或者使用sapply一步到位。

sapply(c("First", "Second"), function(c3) 
xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == c3, 1:5]),
simplify="array")
# , , category3 = First,  = metric1.mean,  = First
# 
#          category2
# category1        A        B        C
#         1 5.777778 7.200000 2.571429
#         2 3.200000 6.333333 4.200000
#         3 8.666667 4.666667 5.166667
# 
# , , category3 = First,  = metric1.sd,  = First
# 
#          category2
# category1        A        B        C
#         1 2.905933 3.346640 2.507133
#         2 2.280351 3.444803 2.740641
#         3 2.309401 2.309401 2.757909
# 
# , , category3 = First,  = metric1.mean,  = Second
# 
#          category2
# category1        A        B C
#         1 6.666667 3.000000 3
#         2 4.888889 5.000000 5
#         3 5.666667 6.666667 6
# 
# , , category3 = First,  = metric1.sd,  = Second
# 
#          category2
# category1        A        B        C
#         1 2.309401 1.414214 1.154701
#         2 3.179797 3.464102 2.449490
#         3 3.444803 4.163332 3.464102

相关内容

最新更新