我有两个标题,像这样:
library(dplyr)
my_tib1 <- tibble(feature1 = c("A", "A", "B", "B", "C", "C"), feature2 = c("AA", "BB", "AA", "BB", "AA", "BB"), number = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6))
my_tib2 <- tibble(feature3 = c("TT", "TT", "FF", "FF"), feature2 = c("AA", "BB", "AA", "BB"), number = c(0.6, 0.4, 0.3, 0.8))
看起来像这样:
# A tibble: 6 × 3
feature1 feature2 number
<chr> <chr> <dbl>
1 A AA 0.1
2 A BB 0.1
3 B AA 0.3
4 B BB 0.4
5 C AA 0.05
6 C BB 0.05
# A tibble: 4 × 3
feature3 feature2 number
<chr> <chr> <dbl>
1 TT AA 0.1
2 TT BB 0.4
3 FF AA 0.3
4 FF BB 0.2
注意,feature2
在两个标题中具有相同的类别。对于my_tib1中的feature1
和feature2
, my_tib2中的feature2
和feature3
,每种组合的number
都是唯一的。
对于上下文:number
列表示边际概率,我想将边际分布相乘以得到联合分布(我知道这些假设)。
我认为这需要得到特征1、特征2和特征3的所有可能组合,并将它们的number
乘以一个新的标题列。生成的标题的长度应该是:3 x feature1, 2 x feature2, 2 x feature3.
最后的标题应该像这样:
# A tibble: 12 × 6
feature1 feature2 feature3 number.x number.y number.mult
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 A AA TT 0.1 0.1 0.01
2 A AA FF 0.1 0.4 0.04
...
用number表示数字。
我试过以下方法,我想我已经接近了,但它不太奏效:
my_tib1 %>% full_join(my_tib2, by = "feature2") %>% mutate(number.mult = number.x*number.y)
这只是给了我我正在寻找的12x6的标尺,但数字在数字。
library(data.table)
# convert to data.table format
setDT(my_tib1); setDT(my_tib2)
# create all unique combinations
DT <- CJ(ft1 = my_tib1$feature1,
ft2 = my_tib1$feature2,
ft3 = my_tib2$feature3, unique = TRUE)
# join relevant data
DT[my_tib1, `:=`(number.x = i.number), on = .(ft1 = feature1, ft2 = feature2)]
DT[my_tib2, `:=`(number.y = i.number), on = .(ft3 = feature3, ft2 = feature2)]
# final computation
DT[, number.mult := number.x * number.y][]
# ft1 ft2 ft3 number.x number.y number.mult
# 1: A AA FF 0.1 0.3 0.03
# 2: A AA TT 0.1 0.6 0.06
# 3: A BB FF 0.2 0.8 0.16
# 4: A BB TT 0.2 0.4 0.08
# 5: B AA FF 0.3 0.3 0.09
# 6: B AA TT 0.3 0.6 0.18
# 7: B BB FF 0.4 0.8 0.32
# 8: B BB TT 0.4 0.4 0.16
# 9: C AA FF 0.5 0.3 0.15
#10: C AA TT 0.5 0.6 0.30
#11: C BB FF 0.6 0.8 0.48
#12: C BB TT 0.6 0.4 0.24