r语言 - 寻找所有可能的列组合而不重复的和



我有一个大表,它的结构是这样的:

DT <- data.table(Sample = c(LETTERS[1:3]),
ColA = c(1:3),
ColB = c(3:1),
ColC = 1,
ColD = 2)
> DT
Sample ColA ColB ColC ColD
1:      A    1    3    1    2
2:      B    2    2    1    2
3:      C    3    1    1    2

我希望找到3列的所有可能的非重复组合,然后找到每个样本中这些列中的值的总和。

期望的输出如下所示:

Sample FirstCol SecondCol ThirdCol SumOfCols
1:      A     ColA      ColB     ColC         5
2:      A     ColA      ColB     ColD         6
3:      A     ColA      ColC     ColD         4
4:      A     ColB      ColC     ColD         6
5:      B     ColA      ColB     ColC         5
6:      B     ColA      ColB     ColD         6
7:      B     ColA      ColC     ColD         5
8:      B     ColB      ColC     ColD         5
9:      C     ColA      ColB     ColC         5
10:      C     ColA      ColB     ColD         6
11:      C     ColA      ColC     ColD         6
12:      C     ColB      ColC     ColD         4

目前,我使用combn()来查找列的所有组合,然后我通过使用merge()三次(对组合中的每个列一次)将值附加到它。


Combinations <- combn(colnames(DT[,!1]), 3)
Combinations <- as.data.table(t(Combinations))
lapply(DT$Sample, function(i){
Melted <- as.data.table(melt(DT[Sample == i], id.vars = "Sample"))[,!1]
Output <- merge(Combinations, Melted, by.x="V1", by.y="variable")
Output <- merge(Output, Melted, by.x="V2", by.y="variable")
Output <- merge(Output, Melted, by.x="V3", by.y="variable")
colnames(Output) <- c("ThirdCol", "SecondCol", "FirstCol", "FirstValue", "SecondValue","ThirdValue")
Output$SumOfCols <- Output$FirstValue + Output$SecondValue + Output$ThirdValue
Output$Sample <- i
Output <- Output[,c(8,3,2,1,7)] 
})

虽然这实现了我想要的,但它相当慢,因为我的实际数据集相当大。我想知道有没有人能告诉我怎样才能加快速度。

任何帮助将是感激的,谢谢你!

DT %>%
group_by(Sample) %>%
summarise(s = combn(cur_data(), 3,  (x)c(nms = names(x), Sum = rowSums(x)), 
simplify = FALSE),    .groups = 'drop') %>%
unnest_wider(s) %>%
type.convert(as.is = TRUE)
# A tibble: 12 x 5
Sample nms1  nms2  nms3    Sum
<chr>  <chr> <chr> <chr> <int>
1 A      ColA  ColB  ColC      5
2 A      ColA  ColB  ColD      6
3 A      ColA  ColC  ColD      4
4 A      ColB  ColC  ColD      6
5 B      ColA  ColB  ColC      5
6 B      ColA  ColB  ColD      6
7 B      ColA  ColC  ColD      5
8 B      ColB  ColC  ColD      5
9 C      ColA  ColB  ColC      5
10 C      ColA  ColB  ColD      6
11 C      ColA  ColC  ColD      6
12 C      ColB  ColC  ColD      4

最新更新