r-用平均值查找和填充缺失观测值(整行不是NA值)



我希望能得到一些帮助来识别:

  1. 当不存在NA值时,缺失观测值的位置(缺少整行(
  2. 为缺少的数据创建一行,该行的平均值基于某些类别的平均值

我的df:

  • 曲棍球运动员在本赛季训练中的数值负荷数据

  • 有时,即使选手进行了练习,加速度计在练习过程中也不起作用。因此,为了确保我们仍然可以跟踪他们在训练中所做的工作,我想插入他们在训练后位置(前锋、防守或守门员(的平均值。(即,如果守门员的加速度计不工作,我想取其他守门员的平均负荷,并将其插入该球员的观察行中进行练习(。

  • 如果在加速度计不工作的情况下,每次训练都有一行所有玩家和负载的NA值,这将是一项更简单的任务,但当我从在线云下载csv中的数据时,只有有工作单位的玩家的行。所以这部分是我无法控制的。

> head(DummyLoads)  
Name       Date Load Position    
1   Jim 2019-10-19  900    2.100  Forward 
2   Bob 2019-10-19  900    2.100  Forward 
3  Dave 2019-10-19  900    2.100  Forward 
4 Steve 2019-10-19  850    2.312  Forward 
5  Fred 2019-10-19  850    2.312  Defense 
6   Ray 2019-10-19  850    2.312  Defense
DummyLoads <- structure(list(Name = structure(c(4L, 1L, 2L, 6L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 4L, 1L, 2L, 6L, 3L, 5L, 2L, 6L, 3L, 5L),
.Label = c("Bob",  "Dave", "Fred", "Jim", "Ray", "Steve"), 
class = "factor"), 
Date = structure(c(1L,  1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,  4L, 4L, 4L, 4L), 
.Label = c("2019-10-19", "2019-10-20", "2019-10-21",  "2019-10-22"), class = "factor"), 
Load = c(900L, 900L, 900L,  850L, 850L, 850L, 789L, 789L, 789L, 960L, 960L, 909L, 909L, 909L,  991L, 991L, 991L, 720L, 717L, 717L, 717L), 
Load.Min = c(2.1,  2.1, 2.1, 2.312, 2.312, 2.312, 2.22, 2.22, 2.22, 2, 2, 1.88,  1.88, 1.88, 1.99, 1.99, 1.99, 2.1, 2.3, 2.3, 2.3), Position = structure(c(2L,  2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,  2L, 2L, 1L, 1L), 
.Label = c("Defense", "Forward"), class = "factor")), class = "data.frame", row.names = c(NA,  -21L))
ggplot(DummyLoads, aes(x = Name, y = Load, fill = Position))+
geom_bar(stat = "identity")+
facet_grid(~Date)

这是一张显示缺少数据的玩家的图表。

缺少玩家加载

理想情况下,我希望能够识别那些缺失的数据点,而不必首先绘制它。我还想避免每次都要手动计算平均值,然后再输入。我希望找到一个自动化的解决方案,因为我将有一整个赛季的练习来做这件事,但我知道这可能很棘手!

提前感谢您的任何建议。如果我没有把事情解释清楚,我深表歉意。

更新到当前问题:

DummyLoads <- DummyLoads %>% 
ungroup()
full_data <- expand.grid(
Athlete = DummyLoads %>% 
pull(Athlete) %>% 
unique(),
Date = DummyLoads %>% 
pull(Date) %>% 
unique())

full_data %>% 
# join incomplete data onto full data
left_join(DummyLoads, by = c("Athlete", "Date")) %>% 
# assign the position to each player
# in the example data, some players do ahve different positions
# if this is true, than it would be unclear which average should be
# considered. Therefore, I assumed their position is constant
left_join(DummyLoads %>% 
select(Athlete, Position) %>% 
distinct(Athlete, .keep_all=TRUE),
by = "Athlete") %>% 
# keep both to check the differences
rename(Position = Position.y) %>% 
group_by(Date, Position) %>% 
# if Load is missing, take the mean of Loads which is grouped
mutate(Load2 = coalesce(PL_Avg, 
mean(PL_Avg, na.rm = T)))

通用基本R解决方案:

# Mark out NA rows flatten in single observation (each element denoting a vector of the df):
is_val_na <- apply(data.frame(lapply(DummyLoads,
function(x){is.na(x)})), 1, paste, collapse = ", " )
# Split up using the grouping var "Name", and impute the mean where NA, coerce list to df: 
DummyLoads_imputed <- do.call("rbind", lapply(split(DummyLoads, DummyLoads$Position), 
function(x){
if(is.numeric(x)){
ifelse(is.na(x), mean(x, na.rm = TRUE), x)
}else{x}
}
)
)
# Bind the data.frame with a factor vector holding the T/F values: 
DummyLoads_imputed <- cbind(DummyLoads_imputed, row_na = as.factor(is_val_na))

使用的数据:

DummyLoads <- structure(list(Name = structure(c(4L, 1L, 2L, 6L, 3L, 5L, 4L, 1L, 2L, 3L, 5L, 4L, 1L, 2L, 6L, 3L, 5L, 2L, 6L, 3L, 5L), 
.Label = c("Bob",  "Dave", "Fred", "Jim", "Ray", "Steve"),
class = "factor"), Date = structure(c(1L,  1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
          3L, 3L, 3L,  4L, 4L, 4L, 4L), .Label = c("2019-10-19", "2019-10-20", "2019-10-21",  "2019-10-22"), 
        class = "factor"), Load = c(900L, 900L, 900L,  850L, 850L, 850L, 789L, 789L, 789L, 960L, 
                                    960L, 909L, 909L, 909L,  991L, 991L, 991L, 720L, 717L, 717L, 717L), 
Position = structure(c(2L,  2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,  1L, 2L, 1L, 1L), 
.Label = c("Defense", "Forward"), class = "factor")), row.names = c(NA,  -21L), class = "data.frame")

最新更新