我有一个唯一区(dist(和一个向量(dist_plus(,每个区都有一些额外的价值。 我的目标是创建"结果",其中类似的地区名称将被唯一的地区替换。
dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")
result <- c("Bengaluru", "Bengaluru", "Andaman","Andaman","South 24 Parganas")
最简单的方法是什么?谢谢。
dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")
library(tidyverse)
# vectorised function to spot matches
f = function(x,y) grepl(x, y)
f = Vectorize(f)
# create a look up table of matches
expand.grid(dist_plus=dist_plus, dist=dist, stringsAsFactors = F) %>%
filter(f(dist, dist_plus)) -> look_up
# join dist_plus values with their matches
data.frame(dist_plus, stringsAsFactors = F) %>%
left_join(look_up, by="dist_plus") %>%
pull(dist)
#[1] "Bengaluru" "Bengaluru" "Andaman" "Andaman" "South 24 Parganas"
最好的方法是让你很好地理解它。有很多方法。这是使用for
循环的一种方法。
# create an empty result with NAs
# if your final result has any NAs it means something probably went wrong
result <- rep(NA, length(dist_plus))
# for each dist_plus check if it contains any of the dist
for (d in 1:length(dist_plus)) {
# d is an integer and it will span from 1 to how many elements dist_plus has
# traverse all elements of dist (sapply =~ for ()) and see if
# any element appears in your subsetted dist_plus[d]
incl <- sapply(dist, FUN = function(x, y) grepl(x, y), y = dist_plus[d])
# find which element is this (dist[incl]) and write it to your result
result[d] <- dist[incl]
}
[1] "Bengaluru" "Bengaluru" "Andaman" "Andaman"
[5] "South 24 Parganas"
您可以使用str_detect
来比较相似的单词: 首先,使用str_detect
检查相似的单词,如果存在,则从向量替换单词dist
并在dist_plus
中的所有元素上loop
。
library(stringr)
c(na.omit(unlist(lapply(dist_plus, function(x) ifelse(str_detect(x, dist),dist,NA)))))
输出:
[1] "Bengaluru" "Bengaluru" "Andaman" "Andaman" "South 24 Parganas"
以下内容将执行您想要的操作。
inx <- lapply(dist, function(s) grep(s, dist_plus))
result2 <- character(length(dist_plus))
for(i in seq_along(inx)){
result2[ inx[[i]] ] <- dist[i]
}
在下面的测试中result
是问题中的向量。
identical(result, result2)
#[1] TRUE
谢谢大家提供这么多不同的方法来解决问题。我也想出了一个解决方案。
library(plyr)
dist <- c("Bengaluru", "Andaman","South 24 Parganas")
dist_plus <- c("Bengaluru Rural", "Bengaluru Urban", "South Andaman","North Andaman","South 24 Parganas")
result <- c("Bengaluru", "Bengaluru", "Andaman","Andaman","South 24 Parganas")
r <- dist_plus
l_ply(dist, function(x){
r[grepl(x, dist_plus)] <<- x
})
identical(r, result)
#[1] TRUE