我正在寻求有关R的帮助。我想在包含时间序列数据并具有大量NA值的现有数据帧中添加三列。这些数据是关于考试成绩的。我想添加的第一列是可用的第一个测试分数。在第二栏中,我想要最后一个测试分数。在第三列中,我想通过将第一个和最后一个分数之间的差除以通过的测试次数来计算每一行的导数。重要的是,过去的一些测试有NA值,但我仍然希望在除法时包括这些值。然而,我不想计算最后一次可用测试分数之后的NA值。
对我的数据的一些解释:A有几个数据框架,它们都有不同人的测试分数。不同的人是行,每列代表一个测试分数。在数据框架中,同一测试的每个人都有多个测试分数。T1列显示他们的第一个分数,T2列显示一周后收集的第二个分数,依此类推。有些人比其他人更早开始,因此有更多的测试分数。此外,由于各种原因,开头和中间的一些分数丢失了。请参阅下面的两个示例表,其中索引列是数据帧的实际索引,而不是单独的列。索引中缺少一些数字(如3(,因为这个人的行中只有NA值,我将其删除。指数保持这种状态对我来说很重要。
示例1(测试A(:
INDEX | T1 | T2T3 | T4 | T5T6||
---|---|---|---|---|---|
1 | NA | NA | |||
2 | 57 | 57 | 57 | NA | |
44 | NA | NA | |||
5 | 9 | 11 | 1117 | 12NA
使用一个pmap_*
pmap_dfr(df1, ~{c(...) %>% t %>% as.data.frame() %>%
mutate(first_score = first(na.omit(c(...)[-1])),
last_score = last(na.omit(c(...)[-1])),
deriv = (last_score - first_score)/max(which(!is.na(c(...)[-1]))))})
INDEX T1 T2 T3 T4 T5 T6 first_score last_score deriv
1 1 NA NA NA 3 4 5 3 5 0.3333333
2 2 57 57 57 57 NA NA 57 57 0.0000000
3 4 44 NA NA NA NA NA 44 44 0.0000000
4 5 9 11 11 17 12 NA 9 12 0.6000000
在dplyr
中仅使用cur_data
而不使用rowwise()
,这减慢了的操作
df1 %>% group_by(INDEX) %>%
mutate(first_score = c_across(starts_with('T'))[min(which(!is.na(cur_data())))],
last_score = c_across(starts_with('T'))[max(which(!is.na(cur_data()[1:6])))],
deriv = (last_score - first_score)/max(which(!is.na(cur_data()[1:6]))))
我认为您可以使用以下解决方案。令人惊讶的是,它有点冗长和费解,但我认为它非常有效。我假设如果Last
可用分数实际上不是最后一个T
,那么我需要检测它的索引并将差值除以它,这意味着最后一个之后的NA
值不算数。否则,它将除以所有可用的T
的数量。
library(dplyr)
library(purrr)
df %>%
select(T1:T6) %>%
pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
exec(rbind, !!!.) %>%
as_tibble() %>%
set_names(c("First", "Last")) %>%
bind_cols(df) %>%
relocate(First, Last, .after = last_col()) %>%
rowwise() %>%
mutate(Derivative = ifelse(!is.na(T6) & T6 == Last, (Last - First)/(length(df)-1),
(Last - First)/last(which(c_across(T1:T6) == Last))))
# First Sample Data
# A tibble: 4 x 10
# Rowwise:
INDEX T1 T2 T3 T4 T5 T6 First Last Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 3 5 0.333
2 2 57 57 57 57 NA NA 57 57 0
3 4 44 NA NA NA NA NA 44 44 0
4 5 9 11 11 17 12 NA 9 12 0.6
第二个样本数据
df2 %>%
select(T1:T4) %>%
pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
exec(rbind, !!!.) %>%
as_tibble() %>%
set_names(c("First", "Last")) %>%
bind_cols(df2) %>%
relocate(First, Last, .after = last_col()) %>%
rowwise() %>%
mutate(Derivative = ifelse(!is.na(T4) & T4 == Last, (Last - First)/(length(df2)-1),
(Last - First)/last(which(c_across(T1:T4) == Last))))
# A tibble: 4 x 8
# Rowwise:
INDEX T1 T2 T3 T4 First Last Derivative
<int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 17 17 17 0
2 2 11 16 20 20 11 20 2.25
3 4 1 20 NA NA 1 20 9.5
4 5 20 20 20 20 20 20 0
这里有一个没有硬编码的tidyverse解决方案。首先,我会延长数据透视时间,然后提取每个INDEX的统计数据。
library(tidyverse)
df1 %>%
pivot_longer(-INDEX, names_to = "time", names_prefix = "T", names_transform = list(time = as.integer)) %>%
filter(!is.na(value)) %>%
group_by(INDEX) %>%
summarize(FirstScore = first(value), LastScore = last(value), divisor = max(time)) %>%
mutate(Derivative = (LastScore - FirstScore) / divisor) %>%
right_join(df1) %>%
select(INDEX, T1:T6, FirstScore, LastScore, Derivative)
对于该输出:
# A tibble: 4 x 10
INDEX T1 T2 T3 T4 T5 T6 FirstScore LastScore Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 3 5 0.333
2 2 57 57 57 57 NA NA 57 57 0
3 4 44 NA NA NA NA NA 44 44 0
4 5 9 11 11 17 12 NA 9 12 0.6
第二个数据的输出,代码不变:
# A tibble: 4 x 10
INDEX T1 T2 T3 T4 T5 T6 FirstScore LastScore Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 17 17 0
2 2 57 57 57 57 NA NA 11 20 2.25
3 4 44 NA NA NA NA NA 1 20 9.5
4 5 9 11 11 17 12 NA 20 20 0
样本数据
df1 <- data.frame(
INDEX = c(1L, 2L, 4L, 5L),
T1 = c(NA, 57L, 44L, 9L),
T2 = c(NA, 57L, NA, 11L),
T3 = c(NA, 57L, NA, 11L),
T4 = c(3L, 57L, NA, 17L),
T5 = c(4L, NA, NA, 12L),
T6 = c(5L, NA, NA, NA)
)
df2 <- data.frame(
INDEX = c(1L, 2L, 4L, 5L),
T1 = c(NA, 11L, 1L, 20L),
T2 = c(NA, 16L, 20L, 20L),
T3 = c(NA, 20L, NA, 20L),
T4 = c(17L, 20L, NA, 20L)
)
您也可以执行以下操作:
df1 %>%
rowwise()%>%
mutate(firstScore = first(na.omit(c_across(T1:T6))),
lastScore = last(na.omit(c_across(T1:T6))),
Derivative = (lastScore-firstScore)/max(which(!is.na(c_across(T1:T6)))))
# A tibble: 4 x 10
# Rowwise:
INDEX T1 T2 T3 T4 T5 T6 firstScore lastScore Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 3 5 0.333
2 2 57 57 57 57 NA NA 57 57 0
3 4 44 NA NA NA NA NA 44 44 0
4 5 9 11 11 17 12 NA 9 12 0.6