我想过滤位于另一列中包含2个或更多单词的行。
我有一个像这样的数据框架:
df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"),
name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))
我想创建一个条件,过滤在第一列(name1)和第二列(name2)中包含2个或更多相同单词的行。我想要的结果是:
<表类>name1 name2 Carlos Lopez Rey Carlos Lopez, 莫妮卡·纳兰霍·加西亚 莫妮卡·德·纳兰霍 Alejandro Martinez de Amor Alejandro Martinez de Amor 表类>
按单词拆分字符串,使用length(intersect(...))
查找常见单词,并仅选择至少有2个相同单词的行
result <- subset(df, mapply(function(x, y) length(intersect(x, y)),
strsplit(name1, ',|\s+'), strsplit(name2, ',|\s+')) >= 2)
result
# name1 name2
#1 Carlos Lopez Rey Lopez, Carlos
#2 Monica Naranjo Garcia Monica de Naranjo
#4 Alejandro Martinez Amor Alejandro Martinez de Amor
更新:感谢Martin Gal的宝贵建议:
请考虑将两个mutate
和filter
合并成一行:
library(dplyr)
library(stringr)
df %>%
filter(str_detect(name2, str_replace_all(name1," ", "|")))
1。答:我们可以用str_replace_all
创建一个模式列,然后用str_detect
和filter
标记name1
的字符串在name2
中的所有行:
library(dplyr)
library(stringr)
df %>%
mutate(pattern_name1 = str_replace_all(name1," ", "|")) %>%
mutate(flag = str_detect(name2, pattern_name1)) %>%
filter(flag == TRUE) %>%
select(1,2)
输出:
name1 name2
1 Carlos Lopez Rey Lopez, Carlos
2 Monica Naranjo Garcia Monica de Naranjo
3 Antonio Perez Reverte Garcia, Antonio
4 Alejandro Martinez Amor Alejandro Martinez de Amor
5 Iñigo Muruzabal Muruzabal, Javier
可以用for-loop
解决你的数据集:
df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"),
name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))
删除逗号和分隔符:
name1 <- strsplit(gsub(",","",df$name1), ' ')
name2 <- strsplit(gsub(",","",df$name2), ' ')
现在使用循环查找行:
rows <- c()
for (i in 1:nrow(df)){
# i = 1
if (sum(name1[[i]] %in% name2[[i]]) > 1){
rows <- append(rows, i)
}
}
df_select <- df[rows,]