r语言 - 重构数据-基于多个分组列得到均值


library(tidyverse)
set.seed(10)
dat <- data.frame(age = sample(14:79, size = 15, replace = TRUE),
sex = sample(c("m", "f"), size = 15, replace = TRUE),
region = sample(c("A", "B", "C"), size = 15, replace = TRUE),
var1 = runif(15, min = 0, max=100)) %>%
mutate(agegrp = cut(age, breaks = c(-Inf, 20, 50, 70, Inf), labels = c("<= 20", "21-50", "51-70", ">70"))) 

表是这样的:

age sex region      var1 agegrp
1   79   m      A 64.669975    >70
2   78   m      C 92.789062    >70
3   23   m      A 69.626845  21-50
4   25   m      C  5.074013  21-50
5   60   f      C 10.340510  51-70
6   36   f      B 90.294240  21-50
7   23   m      A 12.769088  21-50
8   27   f      A 43.892321  21-50
9   35   f      B 99.793467  21-50
10  40   f      C 94.284903  21-50
11  25   m      A 98.829001  21-50
12  55   m      A 98.007185  51-70
13  43   f      A 37.491168  21-50
14  68   m      A 90.051414  51-70
15  76   f      B 13.567239    >70

不幸的是,我们的客户需要的数据格式非常奇怪,就像这样:

split  value  var1_mean
1 agegrp 21-50       61.3
2 agegrp 51-70       66.1
3 agegrp >70         57.0
4 sex    m           55.7
5 sex    f           66.5
6 region A           64.4
7 region B           67.9
8 region C           50.6

我可以使用下面的代码轻松地做到这一点,但它非常不美观:
age <- dat %>% group_by(agegrp) %>%
summarise(var1_mean = mean(var1)) %>%
mutate(value = agegrp,
split = "agegrp") %>%
select(split, value, var1_mean)
sex <- dat %>% group_by(sex) %>%
summarise(var1_mean = mean(var1)) %>%
mutate(value = "sex",
split = "sex") %>%
select(split, value, var1_mean)
region <- dat %>% group_by(region) %>%
summarise(var1_mean = mean(var1)) %>%
mutate(value = "region",
split = "region") %>%
select(split, value, var1_mean)
rbind(age, sex, region)

有没有办法在不"堆叠"的情况下使它更容易呢?几个表手动(可能使用dplyr)?

您可以转向长和总结,或者,迭代感兴趣的变量:

library(dplyr)
library(purrr)
library(tidyr)
dat %>%
pivot_longer(-c(var1, age), names_to = "split") %>%
group_by(split, value) %>%
summarise(var1_mean = mean(var1))

或:

map_df(set_names(c("agegrp", "sex", "region")), ~ dat %>% 
group_by(across(.x)) %>% 
summarise(var1_mean = mean(var1)) %>%
rename(value = .x), .id = "split")
# A tibble: 8 × 3
split  value var1_mean
<chr>  <chr>     <dbl>
1 agegrp <= 20      54.6
2 agegrp 21-50      44.7
3 agegrp 51-70      46.4
4 sex    f          37.4
5 sex    m          55.5
6 region A          67.3
7 region B          47.9
8 region C          26.1

循环遍历列和聚合,然后行绑定结果:

do.call(rbind, 
lapply(c("agegrp", "sex", "region"), function(i){
cbind(split = i,
setNames(aggregate(as.formula(paste("age ~", i)), mean, data = dat), 
c("value", "var1_mean")))
}))
#    split value var1_mean
# 1 agegrp <= 20  20.00000
# 2 agegrp 21-50  33.30000
# 3 agegrp 51-70  60.50000
# 4    sex     f  39.62500
# 5    sex     m  39.71429
# 6 region     A  35.75000
# 7 region     B  39.00000
# 8 region     C  43.60000

注意:输出不同,可能是因为种子数据和显示数据不匹配。

最新更新