当某些文本的一部分包含在R中时重新编码



当变量有特定文本时,我需要对其重新编码。下面是一个示例数据集,看起来像:

df <- data.frame(id = c(1,2,3,4,5,6),
var1 = c("Discontinue", "Discontunie","discontinue", "disc","DISCONTINUE","NR"))
> df
id        var1
1  1 Discontinue
2  2 Discontunie
3  3 discontinue
4  4        disc
5  5 DISCONTINUE
6  6          NR

var1有一些拼写错误、大写、小写等的中断信息。我相信使用disc文本可以很好地识别这些值。我需要将v1重新编码为discontinue。如何进行以下操作。

> df
id        var1
1  1 discontinue
2  2 discontinue
3  3 discontinue
4  4 discontinue
5  5 discontinue
6  6          NR
df <- data.frame(id = c(1,2,3,4,5,6),
var1 = c("Discontinue", "Discontunie","discontinue", "disc","DISCONTINUE","NR"))

df$var1 <- ifelse(grepl("^disc", df$var1, ignore.case = TRUE), "discontinue", df$var1)
df
#>   id        var1
#> 1  1 discontinue
#> 2  2 discontinue
#> 3  3 discontinue
#> 4  4 discontinue
#> 5  5 discontinue
#> 6  6          NR

创建于2022-10-04,reprex v2.0.2

下面的操作应该完成,它使用grep识别var1包含文本disc的行,而不考虑大小写(ignore.case = TRUE(,并将其替换为"中止":

df[grep("disc", df$var1, ignore.case = TRUE), "var1"] <- "discontinue"

输出:

#   id        var1
# 1  1 discontinue
# 2  2 discontinue
# 3  3 discontinue
# 4  4 discontinue
# 5  5 discontinue
# 6  6          NR

最新更新