r语言 - 匹配并用常用词替换相似的单词



我有一个唯一区(dist(和一个向量(dist_plus(,每个区都有一些额外的价值。 我的目标是创建"结果",其中类似的地区名称将被唯一的地区替换。

dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")

result <- c("Bengaluru", "Bengaluru", "Andaman","Andaman","South 24 Parganas")

最简单的方法是什么?谢谢。

dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")
library(tidyverse)
# vectorised function to spot matches
f = function(x,y) grepl(x, y)
f = Vectorize(f)
# create a look up table of matches
expand.grid(dist_plus=dist_plus, dist=dist, stringsAsFactors = F) %>%
filter(f(dist, dist_plus)) -> look_up
# join dist_plus values with their matches 
data.frame(dist_plus, stringsAsFactors = F) %>%
left_join(look_up, by="dist_plus") %>%
pull(dist)
#[1] "Bengaluru"         "Bengaluru"         "Andaman"           "Andaman"           "South 24 Parganas"

最好的方法是让你很好地理解它。有很多方法。这是使用for循环的一种方法。

# create an empty result with NAs
# if your final result has any NAs it means something probably went wrong
result <- rep(NA, length(dist_plus))
# for each dist_plus check if it contains any of the dist
for (d in 1:length(dist_plus)) {
# d is an integer and it will span from 1 to how many elements dist_plus has
# traverse all elements of dist (sapply =~ for ()) and see if 
# any element appears in your subsetted dist_plus[d]
incl <- sapply(dist, FUN = function(x, y) grepl(x, y), y = dist_plus[d])
# find which element is this (dist[incl]) and write it to your result
result[d] <- dist[incl]
}
[1] "Bengaluru"         "Bengaluru"         "Andaman"           "Andaman"          
[5] "South 24 Parganas"

您可以使用str_detect来比较相似的单词: 首先,使用str_detect检查相似的单词,如果存在,则从向量替换单词dist并在dist_plus中的所有元素上loop

library(stringr)
c(na.omit(unlist(lapply(dist_plus, function(x) ifelse(str_detect(x, dist),dist,NA)))))

输出:

[1] "Bengaluru"         "Bengaluru"         "Andaman"           "Andaman"           "South 24 Parganas"

以下内容将执行您想要的操作。

inx <- lapply(dist, function(s) grep(s, dist_plus))
result2 <- character(length(dist_plus))
for(i in seq_along(inx)){
result2[ inx[[i]] ] <- dist[i]
}

在下面的测试中result是问题中的向量。

identical(result, result2)
#[1] TRUE

谢谢大家提供这么多不同的方法来解决问题。我也想出了一个解决方案。

library(plyr)
dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")
result <- c("Bengaluru", "Bengaluru", "Andaman","Andaman","South 24 Parganas")
r <- dist_plus
l_ply(dist, function(x){
r[grepl(x, dist_plus)] <<- x
})
identical(r, result)
#[1] TRUE

最新更新