如何在R data.table中做特殊类型的查找连接?



如何在R数据中执行一种特殊类型的查找连接。

表吗?假设R中有两个表,如下所示:

library(data.table)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))

它们看起来像:

> dt1
a     b
1: p   1,2
2: q 1,2,3
3: r   4,5
> dt2
code   desc
1:    1   good
2:    2 better
3:    3   best
4:    4    bad
5:    5  worst

目标是将dt1dt2连接起来,使result看起来像

> result
a     b             desc
1: p   1,2      good,better
2: q 1,2,3 good,better,best
3: r   4,5        bad,worst

谁能告诉我如何在R中实现这种类型的连接?

这不是一个真正的连接,但dt1$b包含复杂的值无论如何这是我的丑陋的hack:

dt2[, code := as.character(code)] 
dt1[, desc := b]
for (i in seq_along(dt2$code)) 
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a     b             desc
1: p   1,2      good,better
2: q 1,2,3 good,better,best
3: r   4,5        bad,worst

编辑:

替换必须从最长到最短的code(字符串长度或字符数)和desc不能包含任何数字

因此,将setorder(dt2, -code)添加到代码中,并在注释中提供OP提供的新用例:

dt1 <- data.table(a = c("p", "q", "r"), b = c("1,21", "23,11,36", "11,36"))
dt2 <- data.table(code = c(1,11,21,23,36), desc = c("good", "better", "best", "bad", "worst"))
setorder(dt2, -code) # set order first (descending numeric value)
dt2[, code := as.character(code)] # then convert to character
dt1[, desc := b]
for (i in seq_along(dt2$code)) 
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a        b             desc
1: p     1,21        good,best
2: q 23,11,36 bad,better,worst
3: r    11,36     better,worst

编辑2:

根据OP的评论,在生产数据中没有满足desc中没有数字的丑陋hack的要求。(这几乎总是发生,当一个快速的&肮脏的解决方案满足真实世界的数据:-)).

所以这里有一个简洁的data.table解决方案,它做了所有其他答案做的事情:拆分列b,连接或查找匹配的desc,然后重新组合:

dt2[, code := as.character(code)][
dt1[, strsplit(b, ","), by = .(a, b)], on = "code==V1"][
, .(desc = paste(desc, collapse = ",")), by = .(a, b)]

使用OP的新用例

a        b             desc
1: p     1,21        good,best
2: q 23,11,36 bad,better,worst
3: r    11,36     better,worst

注意,分组同时使用ab列有两个原因:1)方便(在最终结果中保留这两个列),2)以防a不是唯一标识符

想法是将列b作为整数列表,然后在dt2中获得列desc的子集(注意code只是行号,否则使用match函数)。

library(purrr)
library(stringr)
dt1[, b := map(b, ~str_split(.x, ",") %>% unlist() %>% as.integer())]
dt1[, desc := map(b, ~dt2$desc[match(.x, dt2$code)])]
library(data.table)
library(magrittr)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
dt1 <- dt1[, list(b = unlist(strsplit(x = b, split = ","))), by = "a"] %>% 
.[, b := type.convert(b)]
dt2[dt1, on = c("code == b")] %>% 
.[, lapply(.SD, toString), by = "a"]
#>    a    code               desc
#> 1: p    1, 2       good, better
#> 2: q 1, 2, 3 good, better, best
#> 3: r    4, 5         bad, worst

由reprex包(v2.0.0)在2021-07-27创建

您可以使用逗号分隔字符串并进行连接。

library(dplyr)
library(tidyr)
dt1 %>%
separate_rows(b, sep = ',\s*', convert = TRUE) %>%
left_join(dt2, by = c('b' = 'code')) %>%
group_by(a) %>%
summarise(desc = toString(desc))
#   a     desc              
#  <chr> <chr>             
#1 p     good, better      
#2 q     good, better, best
#3 r     bad, worst        

最新更新