我不确定如何编写一个函数来替换一系列分类向量中的NA数据。
考虑以下问题:我有一个包含NA数据的分类向量,我想根据现有数据的比例替换NA数据。
例如
a<-factor(c("yes","no","no","yes","yes","yes","no","yes","yes","yes","yes","yes",NA, NA))
我写了以下代码:
a[is.na(a)]<-sample(c("yes","no"),sum(is.na(a)),replace=TRUE,
prob=c(sum(na.omit(a=="yes"))/sum(!is.na(a)),sum(na.omit(a=="no"))/sum(!is.na(a))))
## replace NA with yes or no according to the proportion of yes/no in the non-NA data
上面的代码工作得很好,但现在我有一个包含许多分类变量的数据框架。例如:
a<-c("yes","no","no","yes","yes","yes","no","yes","yes","yes","yes","yes",NA, NA)
b<-c("red","blue","white","red","blue","red","blue","red","blue","red","blue",NA,NA,NA)
c<-c(1,3,2,1,2,3,1,2,3,1,2,3,NA,NA)
a<-as.factor(a) ## ensure the vectors are treated as categorical variable
b<-as.factor(b)
c<-as.factor(c)
df<-data.frame(a=a,b=b,c=c)
我正在努力编写一个函数,允许我在这样的数据框架中替换所有分类变量中的NA数据。请注意,每个变量可能有两个以上的类别。
我将创建一些辅助函数并执行以下操作
helperFunc <- function(x){
sample(levels(x), sum(is.na(x)), replace = TRUE,
prob = as.numeric(table(x))/sum(!is.na(x)))
}
df[sapply(df, is.na)] <- unlist(sapply(df, helperFunc))
测试一些随机种子(例如,123)
set.seed(123)
df[sapply(df, is.na)] <- unlist(sapply(df, helperFunc))
df
# a b c
# 1 yes red 1
# 2 no blue 3
# 3 no white 2
# 4 yes red 1
# 5 yes blue 2
# 6 yes red 3
# 7 no blue 1
# 8 yes red 2
# 9 yes blue 3
# 10 yes red 1
# 11 yes blue 2
# 12 yes red 3
# 13 yes blue 2
# 14 no white 3