在大型数据集中,我需要用相应的代码替换地区名称。
下面是一个可复制的小示例:
library(tidyverse)
library(stringr)
library(dplyr)
library(tidyr)
library(purrr)
library(stringi)
adrs_data <- data.frame (adress = c("6 Frien Street, Paris", "Toulouse, 7 Hospital street", "10 market avenue (Bordeaux)") )
dep_code <- data.frame (code = c("75", "31", "33"), names = c("Paris", "Toulouse", "Bordeaux"))
这是我尝试过的:
d_search<-c(dep_code$names)
d_search <- paste(paste0(d_search[order(-nchar(d_search))]), collapse = "|")
c_search<-c(dep_code $code)
df<-adrs_data %>%
dplyr::mutate(c_adress = case_when(adress %in% d_search ~
str_replace_all(adress, d_search, c_search), TRUE ~ adress))
但是它没有产生想要的输出,即:
df <- data.frame (adress = c("6 Frien Street, 75", "31, 7 Hospital street", "10 market avenue (33)")
谢谢你的帮助,
问好合并两个数据帧后,您可以使用pmap
来替换每一行的模式:
library(dplyr)
library(stringr)
library(purrr)
library(fuzzyjoin)
fuzzy_left_join(adrs_data, dep_code, match_fun = str_detect,
by = c("adress" = "names")) %>%
mutate(adress = pmap(., ~ str_replace(..1, ..3, ..2)))
# adress code names
# 1 6 Frien Street, 75 75 Paris
# 2 31, 7 Hospital street 31 Toulouse
# 3 10 market avenue (33) 33 Bordeaux
这也可以,并删除括号:
我在adrs_data
adrs_data <- data.frame (adress = c("6 Frien Street, Paris", "Toulouse, 7 Hospital street", "10 market avenue (Bordeaux)", "9 Test Street, Paris") )
dep_code <- data.frame (code = c("75", "31", "33"), names = c("Paris", "Toulouse", "Bordeaux"))
sapply(seq(nrow(adrs_data)), (i){
adrs_data[i,] %>%
str_replace_all(dep_code$names, dep_code$code) %>%
.[.!= adrs_data[i,]] %>%
sub(pattern = "\(", replacement = "", x = .) %>%
sub(pattern = "\)", replacement = "", x = .)
})
[,1]
[1,] "6 Frien Street, 75"
[2,] "31, 7 Hospital street"
[3,] "10 market avenue 33"
[4,] "9 Test Street, 75"
对于非常短的数据,这甚至提供了不错的速度。
Unit: milliseconds
min lq mean median uq max neval
1.508001 1.671950 2.481109 1.823151 2.534501 11.9309 100 # lapply
58.665500 67.577851 85.267257 76.948251 89.716950 231.3375 100 # fuzzy
将与@Maël的大数据集解决方案进行比较会很有趣。我假设数据越大,lapply解决方案就越慢。