r-使用dplyr对具有不同功能的不同列进行有效总结



下面的代码正是我想要它做的。获取一个具有多个分组约束(即id和日期(的数据帧,并为每个数据帧返回一行,其中不同的函数应用于不同的列。在这种情况下,折叠为字符并通过分组获取列的最大值。

问题是将这种方法扩展到具有10万行和组的更大的数据帧。我怎样才能更有效地得出同样的结论?如果这在不同的结构中是可能的,那么这是优选的(虽然不是优选的,但数据表解决方案也将被接受(。我希望保持灵活性,以指定要包含在函数调用中的不同列,并添加额外的摘要调用(例如summary_at(值的总和((。

data <- tibble(id = c(1,1,1,2,3,4,5,5,6,6,6), date = dmy("01/01/2020"), var1 =  1:11, var2 = 12:22, var3 = 1:11)
data %>% 
group_by(id, date) %>%
{data.frame(
summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
summarise(., count = n(), .groups = "keep") # return count of cases in group
)} %>% 
select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
as_tibble()
# A tibble: 6 x 6
id date       var1            var2              var3 count
<dbl> <date>     <chr>           <chr>            <int> <int>
1     1 2020-01-01 1 AND 2 AND 3   12 AND 13 AND 14     3     3
2     2 2020-01-01 4               15                   4     1
3     3 2020-01-01 5               16                   5     1
4     4 2020-01-01 6               17                   6     1
5     5 2020-01-01 7 AND 8         18 AND 19            8     2
6     6 2020-01-01 9 AND 10 AND 11 20 AND 21 AND 22    11     3

还有这个问题的链接。如何使用summary_at将不同的函数应用于不同的列?

这是根据@MichaelDewar的答案改编的tidyverse方法,它更整洁,但我认为效率没有任何真正的提高。此外,在我看来,一个有10万行的数据帧也没什么大不了的。我认为tidyverse解决方案很好。

library(dplyr)
data %>% 
group_by(id, date) %>% 
summarise(
across(c(var1, var2), ~paste(unique(.), collapse = " AND ")), 
across(var3, max, na.rm = T), 
count = n(), .groups = "keep"
)

但如果你真的想提高效率,也许可以试试这个数据表解决方案

library(data.table)
setDT(data)[, c(
lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")), 
list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]

基准

set.seed(2020)
data2 <- data[sample.int(nrow(data), 1e5, T), ]
data22 <- data.table::copy(data2)
f1 <- 
. %>% 
group_by(id, date) %>%
{data.frame(
summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
summarise(., count = n(), .groups = "keep") # return count of cases in group
)} %>% 
select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
as_tibble()
f2 <- 
. %>% 
group_by(id, date) %>% 
summarise(
across(c(var1, var2), ~paste(unique(.), collapse = " AND ")), 
across(var3, max, na.rm = T), 
count = n(), 
.groups = "keep"
)
f3 <- function(dt) {
setDT(dt)[, c(
lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")), 
list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]
}
microbenchmark::microbenchmark(f1(data2), f2(data2), f3(data22))

结果

Unit: milliseconds
expr     min       lq      mean   median       uq     max neval cld
f1(data2) 19.6730 20.27990 20.841344 20.50850 20.85045 29.2799   100   c
f2(data2) 13.5455 14.09240 14.705967 14.34585 14.64625 20.5914   100  b 
f3(data22)  6.9186  7.80615  8.598227  8.32035  8.68040 15.8358   100 a  

使用summarize(across()):

paste_distinct <- function(list){
list %>% unique %>% sort %>% paste(collapse = " AND ")
}
data %>% 
group_by(id, date) %>%
summarize(across(c(var1,var2),  paste_distinct),
across(var3, max, na.rm = TRUE),
count = n())

最新更新