我有一个大表,它的结构是这样的:
DT <- data.table(Sample = c(LETTERS[1:3]),
ColA = c(1:3),
ColB = c(3:1),
ColC = 1,
ColD = 2)
> DT
Sample ColA ColB ColC ColD
1: A 1 3 1 2
2: B 2 2 1 2
3: C 3 1 1 2
我希望找到3列的所有可能的非重复组合,然后找到每个样本中这些列中的值的总和。
期望的输出如下所示:
Sample FirstCol SecondCol ThirdCol SumOfCols
1: A ColA ColB ColC 5
2: A ColA ColB ColD 6
3: A ColA ColC ColD 4
4: A ColB ColC ColD 6
5: B ColA ColB ColC 5
6: B ColA ColB ColD 6
7: B ColA ColC ColD 5
8: B ColB ColC ColD 5
9: C ColA ColB ColC 5
10: C ColA ColB ColD 6
11: C ColA ColC ColD 6
12: C ColB ColC ColD 4
目前,我使用combn()来查找列的所有组合,然后我通过使用merge()三次(对组合中的每个列一次)将值附加到它。
Combinations <- combn(colnames(DT[,!1]), 3)
Combinations <- as.data.table(t(Combinations))
lapply(DT$Sample, function(i){
Melted <- as.data.table(melt(DT[Sample == i], id.vars = "Sample"))[,!1]
Output <- merge(Combinations, Melted, by.x="V1", by.y="variable")
Output <- merge(Output, Melted, by.x="V2", by.y="variable")
Output <- merge(Output, Melted, by.x="V3", by.y="variable")
colnames(Output) <- c("ThirdCol", "SecondCol", "FirstCol", "FirstValue", "SecondValue","ThirdValue")
Output$SumOfCols <- Output$FirstValue + Output$SecondValue + Output$ThirdValue
Output$Sample <- i
Output <- Output[,c(8,3,2,1,7)]
})
虽然这实现了我想要的,但它相当慢,因为我的实际数据集相当大。我想知道有没有人能告诉我怎样才能加快速度。
任何帮助将是感激的,谢谢你!
DT %>%
group_by(Sample) %>%
summarise(s = combn(cur_data(), 3, (x)c(nms = names(x), Sum = rowSums(x)),
simplify = FALSE), .groups = 'drop') %>%
unnest_wider(s) %>%
type.convert(as.is = TRUE)
# A tibble: 12 x 5
Sample nms1 nms2 nms3 Sum
<chr> <chr> <chr> <chr> <int>
1 A ColA ColB ColC 5
2 A ColA ColB ColD 6
3 A ColA ColC ColD 4
4 A ColB ColC ColD 6
5 B ColA ColB ColC 5
6 B ColA ColB ColD 6
7 B ColA ColC ColD 5
8 B ColB ColC ColD 5
9 C ColA ColB ColC 5
10 C ColA ColB ColD 6
11 C ColA ColC ColD 6
12 C ColB ColC ColD 4