用对应的代码r替换部分字符串



在大型数据集中,我需要用相应的代码替换地区名称。

下面是一个可复制的小示例:

library(tidyverse)
library(stringr)
library(dplyr)
library(tidyr)
library(purrr)
library(stringi)
adrs_data <- data.frame (adress  = c("6 Frien Street, Paris", "Toulouse, 7 Hospital street", "10 market avenue (Bordeaux)") )
dep_code <- data.frame (code  = c("75", "31", "33"), names  = c("Paris", "Toulouse", "Bordeaux")) 

这是我尝试过的:

d_search<-c(dep_code$names)
d_search <- paste(paste0(d_search[order(-nchar(d_search))]), collapse = "|")
c_search<-c(dep_code $code)
df<-adrs_data %>% 
dplyr::mutate(c_adress = case_when(adress %in% d_search ~ 
str_replace_all(adress,  d_search, c_search), TRUE ~ adress))

但是它没有产生想要的输出,即:

df <- data.frame (adress  = c("6 Frien Street, 75", "31, 7 Hospital street", "10 market avenue (33)") 

谢谢你的帮助,

问好

合并两个数据帧后,您可以使用pmap来替换每一行的模式:

library(dplyr)
library(stringr)
library(purrr)
library(fuzzyjoin)
fuzzy_left_join(adrs_data, dep_code, match_fun = str_detect, 
by = c("adress" = "names")) %>% 
mutate(adress = pmap(., ~ str_replace(..1, ..3, ..2)))
#                  adress code    names
# 1    6 Frien Street, 75   75    Paris
# 2 31, 7 Hospital street   31 Toulouse
# 3 10 market avenue (33)   33 Bordeaux

这也可以,并删除括号:

我在adrs_data

中添加了一个条目
adrs_data <- data.frame (adress  = c("6 Frien Street, Paris", "Toulouse, 7 Hospital street", "10 market avenue (Bordeaux)", "9 Test Street, Paris") )
dep_code <- data.frame (code  = c("75", "31", "33"), names  = c("Paris", "Toulouse", "Bordeaux")) 
sapply(seq(nrow(adrs_data)), (i){ 
adrs_data[i,] %>% 
str_replace_all(dep_code$names, dep_code$code) %>% 
.[.!= adrs_data[i,]] %>% 
sub(pattern = "\(", replacement = "", x = .) %>% 
sub(pattern = "\)", replacement = "", x = .) 
})
[,1]                   
[1,] "6 Frien Street, 75"   
[2,] "31, 7 Hospital street"
[3,] "10 market avenue 33"  
[4,] "9 Test Street, 75"   

对于非常短的数据,这甚至提供了不错的速度。

Unit: milliseconds
min        lq      mean    median        uq      max neval
1.508001  1.671950  2.481109  1.823151  2.534501  11.9309   100 # lapply
58.665500 67.577851 85.267257 76.948251 89.716950 231.3375   100 # fuzzy

将与@Maël的大数据集解决方案进行比较会很有趣。我假设数据越大,lapply解决方案就越慢。

最新更新