r语言 - 使用具有多个折叠函数的数据表聚合 data.frame 行



我对这个示例结构有很大的data.frame

df <- data.frame(id = rep(c("a","b","c"),4), sex = rep(c("M","F"),6), score = 1:12)

我想通过id列和逗号分隔粘贴唯一的sex值并保留最大值score值有效地聚合它。

如何修改此data.table函数以实现:

setDT(df)[, lapply(.SD, function(x) paste(unique(x), collapse = ",")), by = list(id)]

您确定要使用strsplit吗?将sex值保留为list怎么样?这样:

df[ , .(list(sex), max(score)), by = id]
#    id      V1 V2
# 1:  a M,F,M,F 10
# 2:  b F,M,F,M 11
# 3:  c M,F,M,F 12

(我们当然可以随心所欲地命名列)

至于时间,这里是list与。 pastedata.tablepastedplyr 中,我们看到dplyr在一个非平凡大小的数据集上占主导地位:

set.seed(102349)
NN <- 1e6
DT <- data.table(id = sample(c("a","b","c"), NN, TRUE),
                 sex = sample(c("M","F"), NN, TRUE),
                 score = sample(12, NN, TRUE))
library(microbenchmark)
microbenchmark(times = 1000L,
               mikec = DT[ , .(list(unique(sex)), max(score)), by = id],
               mikec_str = DT[ , .(paste(unique(sex), collapse = ","),
                                   score = max(score)), by = id],
               count = DT %>% group_by(id) %>% 
                 summarise(score = max(score), 
                           sex = paste(unique(sex),collapse=",")))
# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval cld
#      mikec 20.31309 20.73779 30.47556 21.95649 35.02822 241.6299  1000  a 
#  mikec_str 20.34941 20.76544 32.05443 22.40155 35.32093 325.3754  1000  a 
#      count 27.20780 29.11735 47.38582 42.93207 44.54086 334.8008  1000   b

你可以试试:

require(dplyr)
df %>% group_by(id) %>% summarise(score = max(score), sex = paste(unique(sex),collapse=","))

最新更新