将整个数据框中的NA值替换为向量中的随机样本

我有一个缺失值的数据框架。我不想用特定的数字填充这些NA值，而是从另一个向量中的一组值随机填充。

ID <- c('A', 'B', 'C')
col1 <- c(5, 2, 1)
col2 <- c(8, 1, 6)
col3 <- c(NA, 2, 3)
col4 <- c(NA, 9, NA)
col5 <- c(NA, NA, NA)
replacementVals <- c(.1, .4, .7, .4, .3, .9, .4)
df <- data.frame(ID, col1, col2, col3, col4, col5)
df
ID col1 col2 col3 col4 col5
1  A    5    8   NA   NA   NA
2  B    2    1    2    9   NA
3  C    1    6    3   NA   NA

我试过使用is.na()和sample的组合，但我没有得到任何工作。我知道我可以在每个单元格上做2个for循环索引，检查它是否为NA，如果它是从列表中采样1个值:

for(row in 1:nrow(df)){
for(col in 2:ncol(df)){
if(is.na(df[row,col])) df[row,col] <- sample(replacementVals, 1)
}
}
df
ID col1 col2 col3 col4 col5
1  A    5    8  0.9  0.4  0.7
2  B    2    1  2.0  9.0  0.4
3  C    1    6  3.0  0.4  0.7

但是我的实际数据帧有数十万行和数百列，时间是一个很大的因素。我希望有一种更有效的方法来解决这个问题，而不是用for循环来强迫它。谢谢!

使用dplyr

library(dplyr)
df |> 
mutate(across(, ~ replace(.x, is.na(.x), 
sample(replacementVals, sum(is.na(.x)), replace = T))))

ID col1 col2 col3 col4 col5
1  A    5    8  0.9  0.1  0.4
2  B    2    1  2.0  9.0  0.9
3  C    1    6  3.0  0.9  0.4

这是向量化的基数R -

set.seed(3244)
inds <- is.na(df)
df[inds] <- sample(replacementVals, sum(inds), replace = TRUE)
df
#  ID col1 col2 col3 col4 col5
#1  A    5    8  0.4  0.1  0.3
#2  B    2    1  2.0  9.0  0.4
#3  C    1    6  3.0  0.9  0.9

循环不一定是坏的，特别是如果您不改变循环中对象的大小。尽管如此，我在我的回答中使用apply循环。如果replacementVals向量小于需要替换的值的数量，则需要在调用sample时使用replace = TRUE参数:

set.seed(1111) #  for reproducibility
df[2:6] <- apply(df[2:6], 2, FUN = function(x){
hit <- which(is.na(x))
x[hit] <- sample(x = replacementVals, size = length(hit), replace = TRUE)
return(x)
})
df
#   ID col1 col2 col3 col4 col5
# 1  A    5    8  0.4  0.9  0.4
# 2  B    2    1  2.0  9.0  0.4
# 3  C    1    6  3.0  0.4  0.1

相关内容

最新更新

热门标签：