当变量有特定文本时,我需要对其重新编码。下面是一个示例数据集,看起来像:
df <- data.frame(id = c(1,2,3,4,5,6),
var1 = c("Discontinue", "Discontunie","discontinue", "disc","DISCONTINUE","NR"))
> df
id var1
1 1 Discontinue
2 2 Discontunie
3 3 discontinue
4 4 disc
5 5 DISCONTINUE
6 6 NR
var1
有一些拼写错误、大写、小写等的中断信息。我相信使用disc
文本可以很好地识别这些值。我需要将v1
重新编码为discontinue
。如何进行以下操作。
> df
id var1
1 1 discontinue
2 2 discontinue
3 3 discontinue
4 4 discontinue
5 5 discontinue
6 6 NR
df <- data.frame(id = c(1,2,3,4,5,6),
var1 = c("Discontinue", "Discontunie","discontinue", "disc","DISCONTINUE","NR"))
df$var1 <- ifelse(grepl("^disc", df$var1, ignore.case = TRUE), "discontinue", df$var1)
df
#> id var1
#> 1 1 discontinue
#> 2 2 discontinue
#> 3 3 discontinue
#> 4 4 discontinue
#> 5 5 discontinue
#> 6 6 NR
创建于2022-10-04,reprex v2.0.2
下面的操作应该完成,它使用grep
识别var1
包含文本disc
的行,而不考虑大小写(ignore.case = TRUE
(,并将其替换为"中止":
df[grep("disc", df$var1, ignore.case = TRUE), "var1"] <- "discontinue"
输出:
# id var1
# 1 1 discontinue
# 2 2 discontinue
# 3 3 discontinue
# 4 4 discontinue
# 5 5 discontinue
# 6 6 NR