我有两个包含文本的数据帧:
Df1: "hello world", "dark world thor", "hello there"
Df2: "world hello", "thor dark world", "there hello"
我想检查df1
中的值是否在df2
中,如果是的话,添加一列显示TRUE/FALSE
谢谢:(
实际上,您想要的是:
library(purrr)
library(stringr)
df1 <- data.frame(
matrix(
c(
"hello world",
"dark world thor",
"hello there"
),
nrow = 3,
ncol = 1,
byrow = TRUE,
dimnames = list(NULL,
c("word1"))
),
stringsAsFactors = FALSE
)
df1_match <- df1 %>%
mutate(matched =
str_detect(word1, "^(?=.*\bhello\b)(?=.*\bworld\b).*$|^(?=.*\bthor\b)(?=.*\bdark\b)(?=.*\bworld\b).*$|^(?=.*\bhello\b)(?=.*\bthere\b).*$")
)
理想情况下,您可以将str_detect
中的那个巨大正则表达式制作成一个对象。例如,以下是您的数据帧:
df2 <- data.frame(
matrix(
c(
"world hello",
"thor dark world",
"there hello"
),
nrow = 3,
ncol = 1,
byrow = TRUE,
dimnames = list(NULL,
c("word2"))
),
stringsAsFactors = FALSE
)
将第二个数据帧转换为要在正则表达式中使用的对象是困难的。我原以为下面这样的方法会奏效,但显然不行。不确定我在这里错过了什么。。。
df2_mod <- df2 %>%
mutate(word2 = strsplit(x = word2, split = " ") %>%
map(~paste0("(?=.*\b", .x, "\b)")) %>%
paste0("^", ., ".*$")
) %>%
.$word2 %>%
paste0(., collapse = "|")
df1_match_alt <- df1 %>%
mutate(matched =
str_detect(word1, df2_mod)
)