很抱歉标题不正确-我正在努力实现以下目标:我有一个数据表dt,它有两个分类变量";a";以及";b";。如您所见,a具有5个唯一值b有三个。现在,例如分类变量的组合("a=1"one_answers"b=3"(不在数据中。
library(data.table)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- rnorm(10)
dt <- data.table(a = a, b = b, y = y)
dt[order(a, b), .N, by = c("a", "b")]
# a b N
#1: 1 1 2
#2: 1 2 1
#3: 2 2 1
#4: 2 3 1
#5: 3 1 1
#6: 3 2 1
#7: 3 3 1
#8: 4 1 1
#9: 5 2 1
如果我简单地求和";a";以及";b";,诸如("a=1"和b=3"(的组将被简单地忽略:
group_sum <- dt[, lapply(.SD, sum), by = c("a", "b")]
group_sum
# a b y
#1: 1 1 -0.7702614
#2: 4 1 -0.2894616
#3: 1 2 -0.2992151
#4: 2 2 -0.4115108
#5: 5 2 0.2522234
#6: 3 2 -0.8919211
#7: 2 3 0.4356833
#8: 3 1 -1.2375384
#9: 3 3 -0.2242679
在数据表中是否存在一种内部方式来";保持";这样的缺失组,并分配0或NA?
实现我目标的一种方法是创建一个网格并在第二步中合并:
grid <- unique(expand.grid(a = dt$a, b = dt$b)) # dim
setDT(grid)
res <- merge(grid, group_sum, by = c("a", "b"), all.x = TRUE)
head(res)
# a b y
#1: 1 1 -0.7702614
#2: 1 2 -0.2992151
#3: 1 3 NA
#4: 2 1 NA
#5: 2 2 -0.4115108
#6: 2 3 0.4356833
实现这一点的一种方法是使用CJ()
函数进行键控交叉连接,然后使用.EACHI
来注意y
应该对i
中的每一行执行。
library(data.table)
set.seed(1)
a <- sample(1:5, 10, replace = TRUE)
b <- sample(1:3, 10, replace = TRUE)
y <- rnorm(10)
dt <- data.table(a = a, b = b, y = y)
setkeyv(dt, c("a", "b"))
dt[CJ(a, b, unique = TRUE), lapply(.SD, sum), by = .EACHI]
#> a b y
#> 1: 1 1 -0.7702614
#> 2: 1 2 -0.2992151
#> 3: 1 3 NA
#> 4: 2 1 NA
#> 5: 2 2 -0.4115108
#> 6: 2 3 0.4356833
#> 7: 3 1 -1.2375384
#> 8: 3 2 -0.8919211
#> 9: 3 3 -0.2242679
#> 10: 4 1 -0.2894616
#> 11: 4 2 NA
#> 12: 4 3 NA
#> 13: 5 1 NA
#> 14: 5 2 0.2522234
#> 15: 5 3 NA
由reprex包(v0.3.0(于2020-10-03创建
如果您想跳过密钥设置步骤,您也可以设置on
参数:
dt <- data.table(a = a, b = b, y = y) # Set no key
dt[CJ(a, b, unique = TRUE), lapply(.SD, sum), by = .EACHI, on = c("a", "b")]
您还可以将dplyr和tidyr与complete((函数一起使用:
library(dplyr)
library(tidyr)
dt %>%
group_by(a,b) %>%
complete(a,b) %>%
summarize_all(sum)
# A tibble: 15 x 3
# Groups: a [5]
a b y
<fct> <fct> <dbl>
1 1 1 -6.93
2 1 2 -2.69
3 1 3 NA
4 2 1 NA
5 2 2 -3.70
6 2 3 3.92
7 3 1 -11.1
8 3 2 -8.03
9 3 3 -2.02
10 4 1 -2.61
11 4 2 NA
12 4 3 NA
13 5 1 NA
14 5 2 2.27
15 5 3 NA