如何在R数据中执行一种特殊类型的查找连接。
表吗?假设R中有两个表,如下所示:
library(data.table)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
它们看起来像:
> dt1
a b
1: p 1,2
2: q 1,2,3
3: r 4,5
> dt2
code desc
1: 1 good
2: 2 better
3: 3 best
4: 4 bad
5: 5 worst
目标是将dt1
和dt2
连接起来,使result
看起来像
> result
a b desc
1: p 1,2 good,better
2: q 1,2,3 good,better,best
3: r 4,5 bad,worst
谁能告诉我如何在R中实现这种类型的连接?
这不是一个真正的连接,但dt1$b
包含复杂的值无论如何这是我的丑陋的hack:
dt2[, code := as.character(code)]
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc 1: p 1,2 good,better 2: q 1,2,3 good,better,best 3: r 4,5 bad,worst
编辑:
替换必须从最长到最短的code
(字符串长度或字符数)和desc
不能包含任何数字。
因此,将setorder(dt2, -code)
添加到代码中,并在注释中提供OP提供的新用例:
dt1 <- data.table(a = c("p", "q", "r"), b = c("1,21", "23,11,36", "11,36"))
dt2 <- data.table(code = c(1,11,21,23,36), desc = c("good", "better", "best", "bad", "worst"))
setorder(dt2, -code) # set order first (descending numeric value)
dt2[, code := as.character(code)] # then convert to character
dt1[, desc := b]
for (i in seq_along(dt2$code))
dt1[, desc := stringr::str_replace_all(desc, dt2$code[i], dt2$desc[i])]
dt1[]
a b desc 1: p 1,21 good,best 2: q 23,11,36 bad,better,worst 3: r 11,36 better,worst
编辑2:
根据OP的评论,在生产数据中没有满足desc
中没有数字的丑陋hack的要求。(这几乎总是发生,当一个快速的&肮脏的解决方案满足真实世界的数据:-)).
所以这里有一个简洁的data.table
解决方案,它做了所有其他答案做的事情:拆分列b
,连接或查找匹配的desc
,然后重新组合:
dt2[, code := as.character(code)][
dt1[, strsplit(b, ","), by = .(a, b)], on = "code==V1"][
, .(desc = paste(desc, collapse = ",")), by = .(a, b)]
使用OP的新用例
a b desc 1: p 1,21 good,best 2: q 23,11,36 bad,better,worst 3: r 11,36 better,worst
注意,分组同时使用a
和b
列有两个原因:1)方便(在最终结果中保留这两个列),2)以防a
是而不是唯一标识符
想法是将列b
作为整数列表,然后在dt2
中获得列desc
的子集(注意code
只是行号,否则使用match
函数)。
library(purrr)
library(stringr)
dt1[, b := map(b, ~str_split(.x, ",") %>% unlist() %>% as.integer())]
dt1[, desc := map(b, ~dt2$desc[match(.x, dt2$code)])]
library(data.table)
library(magrittr)
dt1 <- data.table(a = c("p", "q", "r"),
b = c("1,2", "1,2,3", "4,5"))
dt2 <- data.table(code = 1:5,
desc = c("good", "better", "best", "bad", "worst"))
dt1 <- dt1[, list(b = unlist(strsplit(x = b, split = ","))), by = "a"] %>%
.[, b := type.convert(b)]
dt2[dt1, on = c("code == b")] %>%
.[, lapply(.SD, toString), by = "a"]
#> a code desc
#> 1: p 1, 2 good, better
#> 2: q 1, 2, 3 good, better, best
#> 3: r 4, 5 bad, worst
由reprex包(v2.0.0)在2021-07-27创建
您可以使用逗号分隔字符串并进行连接。
library(dplyr)
library(tidyr)
dt1 %>%
separate_rows(b, sep = ',\s*', convert = TRUE) %>%
left_join(dt2, by = c('b' = 'code')) %>%
group_by(a) %>%
summarise(desc = toString(desc))
# a desc
# <chr> <chr>
#1 p good, better
#2 q good, better, best
#3 r bad, worst