我正在尝试模仿R中的table
Stata命令,该命令执行汇总统计表。该命令允许您在生成的单元格中创建具有不同统计信息的交叉表。例如,在下面的例子中,我将三个变量(category1
、category2
和category3
(交叉,得到metric1
的平均值和标准差以及metric2
的平均值与标准差作为列向量。
所述行为通过Stata上的以下单行获得。
table category1 category2 category3 ,c(mean metric1 sd metric1 mean metric2 sd metric2)
期望输出:表格说明
这里得到的交叉表的每个列向量,假设交叉表的X
包含X = [mean(metric1),sd(metric1), mean(metric2),sd(metric2)]'
----------------------------------------------------------------------------
| category3 and category2
| ------------ First ----------- ----------- Second -----------
category1 | A B C Total A B C Total
----------+-----------------------------------------------------------------
1 | mean(metric1)
| sd(metric1)
| mean(metric2)
| sd(metric1)
所需输出(!(:Stata上的结果表
----------------------------------------------------------------------------
| category3 and category2
| ------------ First ----------- ----------- Second -----------
category1 | A B C Total A B C Total
----------+-----------------------------------------------------------------
1 | 5.778 7.200 2.571 5.048 6.667 3.000 3.000 4.222
| 2.906 3.347 2.507 3.324 2.309 1.414 1.155 2.333
| -1.556 -2.000 -1.143 -1.524 -2.000 -2.000 -3.000 -2.444
| 1.667 0.000 1.069 1.250 0.000 2.828 1.155 1.333
|
2 | 3.200 6.333 4.200 4.571 4.889 5.000 5.000 4.947
| 2.280 3.445 2.741 2.976 3.180 3.464 2.449 2.857
| -0.800 -2.000 -2.000 -1.714 -2.222 -1.500 -1.000 -1.684
| 1.095 1.265 1.333 1.309 1.563 1.000 1.673 1.529
|
3 | 8.667 4.667 5.167 5.667 5.667 6.667 6.000 6.000
| 2.309 2.309 2.758 2.849 3.445 4.163 3.464 3.303
| -3.333 -2.667 -2.000 -2.333 -2.333 -2.000 -1.333 -2.000
| 1.155 1.155 1.477 1.414 0.816 2.000 1.155 1.206
|
Total | 5.529 6.286 4.207 5.067 5.444 5.111 4.615 5.100
| 3.125 3.124 2.795 3.047 3.053 3.333 2.501 2.898
| -1.647 -2.143 -1.793 -1.833 -2.222 -1.778 -1.692 -1.950
| 1.618 0.949 1.346 1.342 1.166 1.563 1.601 1.395
----------------------------------------------------------------------------
生成上述结果的Stata代码
clear all
set obs 100
set seed 777
gen category1 = runiformint(1,3)
gen category2_num = runiformint(1,3)
gen category2 = "A" if category2_num ==1
replace category2 = "B" if category2_num ==2
replace category2 = "C" if category2_num ==3
drop category2_num
gen category3_num = runiformint(1,2)
gen category3 = "First" if category3_num ==1
replace category3 = "Second" if category3_num ==2
drop category3_num
gen metric1 = round(runiform()*10,2)
gen metric2 = round(runiform()*-4,2)
table category1 category2 category3 /// List of the variables that will create the crosstab
,c(mean metric1 sd metric1 /// Mean and std.dev of metric1 as 1st and 2nd rows
mean metric2 sd metric2) /// Mean and std.dev of metric2 as 3rd and 4th rows
row col /// Add the over all statistics total rows and cols
format(%9.3f) // Decimal style setting.
R尝试
以下是我解决这个问题的方法。然而,我离我想要的结果还很远。尽管我在屏幕上显示了相同的信息,但我在R上显示信息的方式可读性非常差。此外,我还没有计算行和列总数的平均值和标准差,我在Stata输出中也计算过。
最后,在我看来,对于这样一个简单的问题,这个程序似乎是一个过度的解决方案。在我的上下文中,包是允许的,因此,欢迎dplyr
或data.table
的建议。
包括Stata生成的数据+复制例程
df <- as.data.frame(structure(list(category1 = structure(c(1, 3, 1, 2, 3, 1, 3, 1,1, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 3, 1, 2, 3, 2, 3, 2, 2, 1,3, 3, 3, 2, 2, 1, 1, 1, 3, 2, 3, 1, 2, 2, 1, 3, 1, 3, 1, 1, 3,1, 1, 2, 1, 3, 2, 2, 3, 3, 3, 1, 2, 3, 2, 3, 2, 1, 1, 1, 2, 2,2, 1, 3, 2, 2, 2, 3, 3), format.stata = "%9.0g"),
category2 = structure(c("C", "A", "A", "A", "C", "C", "A", "A", "A", "A", "B", "A", "A", "A","A", "B", "A", "C", "C", "B", "C", "A", "A", "C", "A", "B", "C", "B", "C", "C", "A", "C", "B", "B", "A", "B", "C", "A", "B", "B","C", "A", "A", "C", "C", "B", "C", "A", "A", "C", "C", "B", "C", "C", "A", "C", "A", "A", "C", "B", "A", "C", "C", "C", "B", "B","C", "C", "A", "A", "C", "C", "A", "C", "B", "B", "C", "C", "C", "C", "A", "C", "C", "C", "C", "B", "B", "B", "B", "C", "A", "A","C", "C", "A", "A", "A", "B", "B", "C"), format.stata = "%9s"),
category3 = structure(c("First", "Second", "First", "First", "First", "First", "Second", "Second", "First", "Second", "First", "First", "Second", "Second", "First", "Second", "Second", "First", "Second", "First", "First", "First", "First","Second", "First", "First", "Second", "First", "First", "First","First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First", "First", "Second", "First", "First","Second", "Second", "First", "Second", "Second", "Second","First", "First", "First", "Second", "Second", "First", "First","Second", "First", "First", "First", "First", "Second", "First","Second", "Second", "First", "Second", "First", "Second", "First", "Second", "First", "First", "First", "First", "Second","First", "First", "First", "Second", "Second", "First", "First","First", "Second", "First", "Second", "First", "Second","Second", "First", "Second", "First", "First", "Second","Second", "Second", "Second", "First"), format.stata = "%9s"),
metric1 = structure(c(0, 10, 0, 0, 8, 4, 4, 8, 8, 2, 4, 4, 6, 2, 6, 8, 6, 4, 4, 10, 10, 4, 6, 8, 6, 2, 4, 4, 6, 0, 6,0, 10, 8, 2, 2, 2, 0, 2, 10, 2, 8, 4, 6, 8, 2, 2, 6, 0, 2,4, 6, 2, 2, 8, 6, 8, 8, 2, 8, 10, 4, 4, 4, 4, 10, 4, 2, 6,4, 6, 4, 10, 2, 8, 6, 8, 2, 6, 6, 6, 4, 8, 6, 8, 2, 10, 2, 6, 2, 10, 4, 8, 0, 10, 6, 4, 2, 8, 8), format.stata = "%9.0g"),
metric2 = structure(c(0, -4, 0, 0, -2, -2, -2, -2, -4, -2, -2, -2, -2, -4, 0, 0, -2, -2, -4, -2, 0, -2, -4, -2, -2, -2, -2, -2, -4, 0, -4, -4, -2, -2, -2, -2, -2, -2, -4, -2, -2, -2, -2, -2, 0, -2, -4, -4, -2, -2, 0, -4, -2, 0, -2,-2, 0, -2, -4, 0, -2, -2, 0, 0, -4, -4, 0, -2, 0, -2, -2, -4, 0, -2, -2, -2, 0, -2, -2, -2, -2, -2, -2, 0, 0, 0, -2, 0, -2, -4, 0, 0, 0, -2, -4, -4, 0, -2, -2, -4), format.stata = "%9.0g")),
row.names = c(NA,-100L), class = c("tbl_df", "tbl", "data.frame")))
# expand grid for every possible value
prs <- expand.grid(cat1 = unique(df$category1) ,
cat2 = unique(df$category2) ,
cat3 = unique(df$category3))
#Number of total combinations
N <- nrow(prs)
#Loop over the combinations to get the desired statistis
A <- lapply(1:N, FUN = function(i){
mean1 <- mean(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
sd1 <- sd(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric1"])
mean2 <- mean(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
sd2 <- sd(df[(df$category1 == prs$cat1[i] & df$category2 == prs$cat2[i] & df$category3 == prs$cat3[i] ), "metric2"])
r_list<- list(cat1 = prs$cat1[i],cat2 = prs$cat2[i], cat3 = prs$cat3[i],
mean1 = mean1, sd1 = sd1 , mean2 = mean2, sd2 = sd2)
return(r_list)
})
#List to data.frame
df_stats <- do.call(rbind.data.frame, A)
获得的输出(但是,不是我想要的输出(!((
# cat1 cat2 cat3 mean1 sd1 mean2 sd2
# 2 1 C First 2.571429 2.507133 -1.142857 1.0690450
# 21 3 C First 5.166667 2.757909 -2.000000 1.4770979
# 3 2 C First 4.200000 2.740641 -2.000000 1.3333333
# 4 1 A First 5.777778 2.905933 -1.555556 1.6666667
# 5 3 A First 8.666667 2.309401 -3.333333 1.1547005
# 6 2 A First 3.200000 2.280351 -0.800000 1.0954451
# 7 1 B First 7.200000 3.346640 -2.000000 0.0000000
# 8 3 B First 4.666667 2.309401 -2.666667 1.1547005
# 9 2 B First 6.333333 3.444803 -2.000000 1.2649111
# 10 1 C Second 3.000000 1.154701 -3.000000 1.1547005
# 11 3 C Second 6.000000 3.464102 -1.333333 1.1547005
# 12 2 C Second 5.000000 2.449490 -1.000000 1.6733201
# 13 1 A Second 6.666667 2.309401 -2.000000 0.0000000
# 14 3 A Second 5.666667 3.444803 -2.333333 0.8164966
# 15 2 A Second 4.888889 3.179797 -2.222222 1.5634719
# 16 1 B Second 3.000000 1.414214 -2.000000 2.8284271
# 17 3 B Second 6.666667 4.163332 -2.000000 2.0000000
# 18 2 B Second 5.000000 3.464102 -1.500000 1.0000000
您可以使用data.table
和magrittr
包,如下所示:
library(magrittr)
library(data.table)
# function to compute the mean and sd
fun <- function(x, y) list(metric1_meam=mean(x), metric1_sd=sd(x), metric2_meam=mean(y), metric2_sd=sd(y))
# compute the Total column, and A,B,C columns of the desired output as follows and bind them
setDT(df)[, 'category1' := as.character(category1)]
Y <- rbind(
df[, fun(metric1, metric2), by=.(category1, category2, category3)],
df[, fun(metric1, metric2), by=.(category1, category3)][, category2 := 'Total'],
df[, fun(metric1, metric2), by=.(category2, category3)][, category1 := 'Total'],
df[, fun(metric1, metric2), by=.(category3)][, c('category1', 'category2') := 'Total']
)
# generate the desired output
melt(Y, measure=patterns('metric')) %>%
xtabs(formula = value ~ .) %>%
ftable(col.vars = c('category3', 'category2'))
category3 First Second
category2 A B C Total A B C Total
category1 variable
1 metric1_meam 5.7777778 7.2000000 2.5714286 5.0476190 6.6666667 3.0000000 3.0000000 4.2222222
metric1_sd 2.9059326 3.3466401 2.5071327 3.3237959 2.3094011 1.4142136 1.1547005 2.3333333
metric2_meam -1.5555556 -2.0000000 -1.1428571 -1.5238095 -2.0000000 -2.0000000 -3.0000000 -2.4444444
metric2_sd 1.6666667 0.0000000 1.0690450 1.2497619 0.0000000 2.8284271 1.1547005 1.3333333
2 metric1_meam 3.2000000 6.3333333 4.2000000 4.5714286 4.8888889 5.0000000 5.0000000 4.9473684
metric1_sd 2.2803509 3.4448028 2.7406406 2.9760952 3.1797973 3.4641016 2.4494897 2.8572264
metric2_meam -0.8000000 -2.0000000 -2.0000000 -1.7142857 -2.2222222 -1.5000000 -1.0000000 -1.6842105
metric2_sd 1.0954451 1.2649111 1.3333333 1.3093073 1.5634719 1.0000000 1.6733201 1.5294382
3 metric1_meam 8.6666667 4.6666667 5.1666667 5.6666667 5.6666667 6.6666667 6.0000000 6.0000000
metric1_sd 2.3094011 2.3094011 2.7579087 2.8491485 3.4448028 4.1633320 3.4641016 3.3028913
metric2_meam -3.3333333 -2.6666667 -2.0000000 -2.3333333 -2.3333333 -2.0000000 -1.3333333 -2.0000000
metric2_sd 1.1547005 1.1547005 1.4770979 1.4142136 0.8164966 2.0000000 1.1547005 1.2060454
Total metric1_meam 5.5294118 6.2857143 4.2068966 5.0666667 5.4444444 5.1111111 4.6153846 5.1000000
metric1_sd 3.1248529 3.1238185 2.7951400 3.0469027 3.0529103 3.3333333 2.5012817 2.8982753
metric2_meam -1.6470588 -2.1428571 -1.7931034 -1.8333333 -2.2222222 -1.7777778 -1.6923077 -1.9500000
metric2_sd 1.6179144 0.9492623 1.3464055 1.3424827 1.1659662 1.5634719 1.6012815 1.3950462
您可以利用aggregate
的力量。
FUN <- function(x) c(mean=mean(x), sd=sd(x))
aggregate(cbind(metric1, metric2) ~ ., df, FUN)
# category1 category2 category3 metric1.mean metric1.sd metric2.mean metric2.sd
# 1 1 A First 5.777778 2.905933 -1.5555556 1.6666667
# 2 2 A First 3.200000 2.280351 -0.8000000 1.0954451
# 3 3 A First 8.666667 2.309401 -3.3333333 1.1547005
# 4 1 B First 7.200000 3.346640 -2.0000000 0.0000000
# 5 2 B First 6.333333 3.444803 -2.0000000 1.2649111
# 6 3 B First 4.666667 2.309401 -2.6666667 1.1547005
# 7 1 C First 2.571429 2.507133 -1.1428571 1.0690450
# 8 2 C First 4.200000 2.740641 -2.0000000 1.3333333
# 9 3 C First 5.166667 2.757909 -2.0000000 1.4770979
# 10 1 A Second 6.666667 2.309401 -2.0000000 0.0000000
# 11 2 A Second 4.888889 3.179797 -2.2222222 1.5634719
# 12 3 A Second 5.666667 3.444803 -2.3333333 0.8164966
# 13 1 B Second 3.000000 1.414214 -2.0000000 2.8284271
# 14 2 B Second 5.000000 3.464102 -1.5000000 1.0000000
# 15 3 B Second 6.666667 4.163332 -2.0000000 2.0000000
# 16 1 C Second 3.000000 1.154701 -3.0000000 1.1547005
# 17 2 C Second 5.000000 2.449490 -1.0000000 1.6733201
# 18 3 C Second 6.000000 3.464102 -1.3333333 1.1547005
对于交叉列表,请尝试xtabs
。
当应用多个函数时,aggregate
会生成列中的矩阵(请参阅这个答案,为什么(,所以首先我们要去掉它们。
r <- do.call(data.frame, aggregate(cbind(metric1, metric2) ~ ., df, FUN))
现在我们可以应用xtabs
,例如,对于每个类别3。
xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == "First", 1:5])
# , , category3 = First, = metric1.mean
#
# category2
# category1 A B C
# 1 5.777778 7.200000 2.571429
# 2 3.200000 6.333333 4.200000
# 3 8.666667 4.666667 5.166667
#
# , , category3 = First, = metric1.sd
#
# category2
# category1 A B C
# 1 2.905933 3.346640 2.507133
# 2 2.280351 3.444803 2.740641
# 3 2.309401 2.309401 2.757909
xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == "Second", 1:5])
# , , category3 = Second, = metric1.mean
#
# category2
# category1 A B C
# 1 6.666667 3.000000 3.000000
# 2 4.888889 5.000000 5.000000
# 3 5.666667 6.666667 6.000000
#
# , , category3 = Second, = metric1.sd
#
# category2
# category1 A B C
# 1 2.309401 1.414214 1.154701
# 2 3.179797 3.464102 2.449490
# 3 3.444803 4.163332 3.464102
或者使用sapply
一步到位。
sapply(c("First", "Second"), function(c3)
xtabs(cbind(metric1.mean, metric1.sd) ~ ., r[r$category3 == c3, 1:5]),
simplify="array")
# , , category3 = First, = metric1.mean, = First
#
# category2
# category1 A B C
# 1 5.777778 7.200000 2.571429
# 2 3.200000 6.333333 4.200000
# 3 8.666667 4.666667 5.166667
#
# , , category3 = First, = metric1.sd, = First
#
# category2
# category1 A B C
# 1 2.905933 3.346640 2.507133
# 2 2.280351 3.444803 2.740641
# 3 2.309401 2.309401 2.757909
#
# , , category3 = First, = metric1.mean, = Second
#
# category2
# category1 A B C
# 1 6.666667 3.000000 3
# 2 4.888889 5.000000 5
# 3 5.666667 6.666667 6
#
# , , category3 = First, = metric1.sd, = Second
#
# category2
# category1 A B C
# 1 2.309401 1.414214 1.154701
# 2 3.179797 3.464102 2.449490
# 3 3.444803 4.163332 3.464102