我正在清理10多个学区提交的人口统计数据,这些数据没有标准化/统一。我想找到模式并重新编码它们,使数据干净简单。
假设我有一个变量叫做Race
,其中一个类别是Native Hawaiian - Pacific Islander
。
学校A将此类别作为Native Hawaiian or Other Pacific Islander
提交。学校B将此类别提交为Native Hawaiian/Pacific Islander
。学校C提交此类别为Native Hawaiian or Pacific Islander
。
如果R在变量中看到Pacific
这个词,它将重新编码为Native Hawaiian - Pacific Islander
,我该如何重新编码?
原始数据:
df_original <- data.frame(Race=c("Native Hawaiian or Other Pacific Islander",
"Native Hawaiian/Pacific Islander", "Native Hawaiian or Pacific Islander",
"Black or African American", "Black", "Black/African American"))
这是理想的清理数据:
df_desired <- data.frame(Race=c("Native Hawaiian - Pacific Islander","Native Hawaiian - Pacific Islander",
"Native Hawaiian - Pacific Islander","Black - African American",
"Black - African American","Black - African American"))
对于包含"Pacific"的字符串,grepl()
将返回TRUE
否则为False
。使用它来子集你的向量,并替换为你想要的字符串:
df_original$Race[grepl("Pacific", df_original$Race)] <- "Native Hawaiian - Pacific Islander"
使用str_detect
和case_when
library(dplyr)
library(stringr)
df_original %>%
mutate(Race2 = case_when(str_detect(Race, '\bPacific\b') ~
"Native Hawaiian - Pacific Islander",
TRUE ~ "Black - African American"))
与产出
Race Race2
1 Native Hawaiian or Other Pacific Islander Native Hawaiian - Pacific Islander
2 Native Hawaiian/Pacific Islander Native Hawaiian - Pacific Islander
3 Native Hawaiian or Pacific Islander Native Hawaiian - Pacific Islander
4 Black or African American Black - African American
5 Black Black - African American
6 Black/African American Black - African American
另一种选择是创建一个键/值数据集,其中包含要替换的模式及其对应的值,然后用原始数据
执行regex_left_join
(从fuzzyjoin
)library(fuzzyjoin)
keydat <- tibble(Race = c("Pacific", "Black"),
Race2 = c("Native Hawaiian - Pacific Islander", "Black - African American"))
regex_left_join(df_original, keydat) %>%
transmute(Race = Race2)
#Joining by: "Race"
# Race
#1 Native Hawaiian - Pacific Islander
#2 Native Hawaiian - Pacific Islander
#3 Native Hawaiian - Pacific Islander
#4 Black - African American
#5 Black - African American
#6 Black - African American