r语言 - 我如何从另一列中过滤包含2个或更多单词的行?



我想过滤位于另一列中包含2个或更多单词的行。

我有一个像这样的数据框架:

df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"), 
name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))

我想创建一个条件,过滤在第一列(name1)和第二列(name2)中包含2个或更多相同单词的行。我想要的结果是:

<表类>name1name2Carlos Lopez ReyCarlos Lopez,莫妮卡·纳兰霍·加西亚莫妮卡·德·纳兰霍Alejandro Martinez de AmorAlejandro Martinez de Amor

按单词拆分字符串,使用length(intersect(...))查找常见单词,并仅选择至少有2个相同单词的行

result <- subset(df, mapply(function(x, y) length(intersect(x, y)), 
strsplit(name1, ',|\s+'), strsplit(name2, ',|\s+')) >= 2)
result
#                    name1                      name2
#1        Carlos Lopez Rey              Lopez, Carlos
#2   Monica Naranjo Garcia          Monica de Naranjo
#4 Alejandro Martinez Amor Alejandro Martinez de Amor

更新:感谢Martin Gal的宝贵建议:

请考虑将两个mutatefilter合并成一行:

library(dplyr)
library(stringr)
df %>%
filter(str_detect(name2, str_replace_all(name1," ", "|")))

1。答:我们可以用str_replace_all创建一个模式列,然后用str_detectfilter标记name1的字符串在name2中的所有行:

library(dplyr)
library(stringr)
df %>% 
mutate(pattern_name1 = str_replace_all(name1," ", "|")) %>% 
mutate(flag = str_detect(name2, pattern_name1)) %>% 
filter(flag == TRUE) %>% 
select(1,2)

输出:

name1                      name2
1        Carlos Lopez Rey              Lopez, Carlos
2   Monica Naranjo Garcia          Monica de Naranjo
3   Antonio Perez Reverte            Garcia, Antonio
4 Alejandro Martinez Amor Alejandro Martinez de Amor
5         Iñigo Muruzabal          Muruzabal, Javier

可以用for-loop

解决你的数据集:

df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"), 
name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))

删除逗号和分隔符:

name1 <- strsplit(gsub(",","",df$name1),  ' ')
name2 <- strsplit(gsub(",","",df$name2), ' ')

现在使用循环查找行:

rows <- c()
for (i in 1:nrow(df)){
# i = 1
if (sum(name1[[i]] %in% name2[[i]]) > 1){
rows <- append(rows, i)
}
}
df_select <- df[rows,]

相关内容

  • 没有找到相关文章

最新更新