Mydata.table
如下所示(复制/粘贴数据请参阅文章底部(。CCD_ 2和CCD_ 3变量都是分组变量。
id category
1: 1 B100
2: 1 B100
3: 1 D300
4: 1 D300
5: 1 B100
6: 2 B100
7: 2 F500
8: 2 F500
9: 2 E600
10: 2 E600
11: 3 T400
12: 3 B100
13: 3 T400
14: 3 T400
假设数据按给定顺序排列正确。在每个id
组中,我想创建一个变量,指示每次连续运行category
的组(例如,请参见此处(。
例如,因为"B100"
在id == 1
(第1:2行和第5行(中有两次连续运行,所以新变量在第1:2行中应取值1
,在第5行中应为值2
,因为第5行是category == "B100"
在id == 1
内"第二次"出现。
对于整个data.table
,我想要的输出是:
id category group
1: 1 B100 1 # The first run of
2: 1 B100 1 # B100 in id 1, length 2
3: 1 D300 1
4: 1 D300 1
5: 1 B100 2 # second run of "B100" within id 1, length 1
6: 2 B100 1
7: 2 F500 1
8: 2 F500 1
9: 2 E600 1
10: 2 E600 1 # no repeated category runs in id 2, so all 1
11: 3 T400 1
12: 3 B100 1
13: 3 T400 2 # The second run of
14: 3 T400 2 # "T400" within id 3, length 2
解决问题的一种方法是使用data.table::rleid()
两次(让数据为DT
(:
library(data.table)
DT[, group := rleid(category), by = id]
DT <- split(DT, by = "id")
DT <- lapply(DT,
(x) x[, group := rleid(group), by = category])
DT <- rbindlist(DT)
问题:有没有一种方法可以避免在第二步中被id
分解?
这个问题的动机
数据复制/粘贴
作为更通用的data.frame
。
DT <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400","T400"))
Output <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400","T400"),
group = c(1,1,1,1,2,1,1,1,1,1,1,1,2,2))
这里有一种使用两个分组操作的方法,不需要拆分:
Output <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400"),
group = c(1,1,1,1,2,1,1,1,1,1,1,1,2))
setDT(Output)
Output[, temp := rleid(category), by = .(id)][, result := as.integer(factor(temp)), by = .(id, category)]
Output
# id category group temp result
# 1: 1 B100 1 1 1
# 2: 1 B100 1 1 1
# 3: 1 D300 1 2 1
# 4: 1 D300 1 2 1
# 5: 1 B100 2 3 2
# 6: 2 B100 1 1 1
# 7: 2 F500 1 2 1
# 8: 2 F500 1 2 1
# 9: 2 E600 1 3 1
# 10: 2 E600 1 3 1
# 11: 3 T400 1 1 1
# 12: 3 B100 1 2 1
# 13: 3 T400 2 3 2
Output[, all(group == result)]
# [1] TRUE