R-使用参考表替换多个值



我正在清洁数据库,其中一个字段是"国家",但是我的数据库中的国家名称与我需要的输出不符。

i虽然使用str_replace功能,但我有50多个国家需要修复,因此这不是最有效的方法。我已经准备了一个具有原始国家输入的CSV文件,以及我需要参考的输出。

这是我到目前为止所拥有的:

library(stringr)
library(dplyr)
library(tidyr)
library(readxl)
database1<- read_excel("database.xlsx") 
database1$country<str_replace(database1$country,"USA","United States")
database1$country<str_replace(database1$country,"UK","United Kingdom")
database1$country<str_replace(database1$country,"Bolivia","Bolivia,Plurinational State of")
write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")

注意:factor内的级别和标签必须是唯一的,或者不应包含重复。

# database1 <- read_excel("database.xlsx")  ## read database excel book
old_names <- c("USA", "UGA", "CHL") ## country abbreviations
new_names <- c("United States", "Uganda", "Chile")  ## country full form

基础r

database1 <- within( database1, country <- factor( country, levels = old_names, labels = new_names ))

data.table

library('data.table')
setDT(database1)
database1[, country := factor(country, levels = old_names, labels = new_names)]
database1
#          country
# 1: United States
# 2:        Uganda
# 3:         Chile
# 4: United States
# 5:        Uganda
# 6:         Chile
# 7: United States
# 8:        Uganda
# 9:         Chile

数据

database1 <- data.frame(country = c("USA", "UGA", "CHL", "USA", "UGA", "CHL", "USA", "UGA", "CHL"))
#    country
# 1     USA
# 2     UGA
# 3     CHL
# 4     USA
# 5     UGA
# 6     CHL
# 7     USA
# 8     UGA
# 9     CHL

编辑:您可以创建一个名为vector countries,而不是两个变量,例如old_names和new_names。

countries <- c("USA", "UGA", "CHL")
names(countries) <- c("United States", "Uganda", "Chile")
within( database1, country <- factor( country, levels = countries, labels = names(countries) ))

过去使用.csv文件进行批量替换时,过去曾使用类似的方法来解决这样的问题。

示例.csv文件格式:

library(data.table)
## Generate example replacements csv file to see the format used
Replacements <- data.table(Old = c("USA","UGA","CHL"),
                           New = c("United States", "Uganda", "Chile"))
fwrite(Replacements,"Replacements.csv")

拥有"替换"后,您可以使用stringi::replace_all_regex()立即使用它来替换所有名称。(就其价值而言,几乎整个stringr软件包本质上都是围绕stringi的包装器。由于stringi的运行速度稍快,并且具有更大的功能,我更喜欢坚持使用stringi。/p>

library(data.table)
library(readxl)
library(stringi)
## Read in list of replacements
Replacements <- fread("Replacements.csv")
## Read in file to be cleaned
database1<- read_excel("database.xlsx")
## Perform Replacements
database1$countries <- stringi::stri_replace_all_regex(database1$countries,
                                              "^"%s+%Replacements$Old%s+%"$",
                                              Replacements$New,
                                              vectorize_all = FALSE)
## Write CSV
write.csv(database1, "test.csv", row.names=FALSE, fileEncoding = 'UTF 8', na="")

我尝试在可能的情况下使用基础r data.frame语法以避免任何混乱,但是如果我为自己做这件事,我会坚持使用完整的 data.table语法:

library(data.table)
library(readxl)
library(stringi)
## Read in list of replacements
Replacements <- fread("Replacements.csv")
## Read in file to be cleaned
database1<- read_excel("database.xlsx")
## Perform Replacements
database1[, countries := stri_replace_all_regex(countries,"^"%s+%Replacements[,Old]%s+%"$",
                                              Replacements[,New],
                                              vectorize_all = FALSE)]
## Write CSV
fwrite(database1,"test.csv")

相关内容

  • 没有找到相关文章

最新更新