r-通过连续运行两个id变量对data.table进行分组,不使用split()



Mydata.table如下所示(复制/粘贴数据请参阅文章底部(。CCD_ 2和CCD_ 3变量都是分组变量。

id category
1:  1     B100
2:  1     B100
3:  1     D300
4:  1     D300
5:  1     B100
6:  2     B100
7:  2     F500
8:  2     F500
9:  2     E600
10:  2     E600
11:  3     T400
12:  3     B100
13:  3     T400
14:  3     T400

假设数据按给定顺序排列正确。在每个id组中,我想创建一个变量,指示每次连续运行category的组(例如,请参见此处(。

例如,因为"B100"id == 1(第1:2行和第5行(中有两次连续运行,所以新变量在第1:2行中应取值1,在第5行中应为值2,因为第5行是category == "B100"id == 1内"第二次"出现。

对于整个data.table,我想要的输出是:

id category group
1:  1     B100     1 # The first run of 
2:  1     B100     1 # B100 in id 1, length 2
3:  1     D300     1
4:  1     D300     1
5:  1     B100     2 # second run of "B100" within id 1, length 1
6:  2     B100     1
7:  2     F500     1
8:  2     F500     1
9:  2     E600     1
10:  2     E600     1 # no repeated category runs in id 2, so all 1
11:  3     T400     1
12:  3     B100     1
13:  3     T400     2 # The second run of 
14:  3     T400     2 # "T400" within id 3, length 2

解决问题的一种方法是使用data.table::rleid()两次(让数据为DT(:

library(data.table)
DT[, group := rleid(category), by = id]
DT <- split(DT, by = "id")
DT <- lapply(DT,
(x) x[, group := rleid(group), by = category])
DT <- rbindlist(DT)

问题:有没有一种方法可以避免在第二步中被id分解?

这个问题的动机

数据复制/粘贴

作为更通用的data.frame

DT <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400","T400"))

Output <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400","T400"),
group = c(1,1,1,1,2,1,1,1,1,1,1,1,2,2))

这里有一种使用两个分组操作的方法,不需要拆分:

Output <- data.frame(id = c(rep(1,5), rep(2,5), rep(3,3)),
category = c("B100","B100","D300","D300","B100",
"B100","F500","F500","E600","E600",
"T400","B100","T400"),
group = c(1,1,1,1,2,1,1,1,1,1,1,1,2))
setDT(Output)
Output[, temp := rleid(category), by = .(id)][, result := as.integer(factor(temp)), by = .(id, category)]
Output
#     id category group temp result
#  1:  1     B100     1    1      1
#  2:  1     B100     1    1      1
#  3:  1     D300     1    2      1
#  4:  1     D300     1    2      1
#  5:  1     B100     2    3      2
#  6:  2     B100     1    1      1
#  7:  2     F500     1    2      1
#  8:  2     F500     1    2      1
#  9:  2     E600     1    3      1
# 10:  2     E600     1    3      1
# 11:  3     T400     1    1      1
# 12:  3     B100     1    2      1
# 13:  3     T400     2    3      2
Output[, all(group == result)]
# [1] TRUE

相关内容

  • 没有找到相关文章

最新更新