r语言 - 随机或按比例为 NA 分配分类值



我有一个数据集:

df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
"male"), Division = c("South Atlantic", "East North Central", 
"Pacific", "East North Central", "South Atlantic", "South Atlantic", 
"Pacific"), Median = c(57036.6262, 39917, 94060.208, 89822.1538, 
107683.9118, 56149.3217, 46237.265), first_name = c("Marilyn", 
"Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

我需要执行一个分析,以便我不能在gender变量中NA值。 其他列太少,并且没有已知的预测值,因此实际上无法插补这些值。

我可以通过完全删除不完整的观测值来执行分析 - 它们约占数据集的 4%,但我希望通过将femalemale随机分配到缺失的案例中来查看结果。

除了编写一些非常丑陋的代码来过滤到不完整的情况,一分为二并将NA替换为每半femalemale,我想知道是否有一种优雅的方法将值随机或按比例分配给NAs?

我们可以使用ifelseis.na来确定na是否存在,然后使用sample随机选择femalemale

df$gender <- ifelse(is.na(df$gender), sample(c("female", "male"), 1), df$gender)

这个怎么样:

> df <- structure(list(gender = c("female", "male", NA, NA, "male", "male", 
+                                 "male"),
+                      Division = c("South Atlantic", "East North Central", 
+                                   "Pacific", "East North Central", "South Atlantic", "South Atlantic", 
+                                   "Pacific"),
+                      Median = c(57036.6262, 39917, 94060.208, 89822.1538,
+                                 107683.9118, 56149.3217, 46237.265),
+                      first_name = c("Marilyn", "Jeffery", "Yashvir", "Deyou", "John", "Jose", "Daniel")),
+                 row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))
> 
> Gender <- rbinom(length(df$gender), 1, 0.52)
> Gender <- factor(Gender, labels = c("female", "male"))
> 
> df$gender[is.na(df$gender)] <- as.character(Gender[is.na(df$gender)])
> 
> df$gender
[1] "female" "male"   "female" "female" "male"   "male"   "male"  
> 

在给定概率下,这是随机的。您还可以考虑使用最近邻居、办公桌轮用或类似方法插补值。

希望对您有所帮助。

只需分配

df$gender[is.na(df$gender)]=sample(c("female", "male"), dim(df)[1], replace = TRUE)[is.na(df$gender)]

最新更新