R-自基础年以来计算变化

  • 本文关键字:计算 变化 r dplyr purrr
  • 更新时间 :
  • 英文 :


我有一个看起来像这样的数据集:

df1 <- data.frame(id = c(rep("A1",4), rep("A2",4)),
                  time = rep(c(0,2:4), 2),
                  y1 = rnorm(8),
                  y2 = rnorm(8))

对于每个y变量,我想从time==0开始计算它们的更改。基本上,我想这样做:

calc_chage <- function(id, data){
  #y1
  y1_0 <- data$y1[which(data$time==0 & data$id==id)]
  D2y1 <- data$y1[which(data$time==2 & data$id==id)] - y1_0
  D3y1 <- data$y1[which(data$time==3 & data$id==id)] - y1_0
  D4y1 <- data$y1[which(data$time==4 & data$id==id)] - y1_0
  #y2
  y2_0 <- data$y2[which(data$time==0 & data$id==id)]
  D2y2 <- data$y2[which(data$time==2 & data$id==id)] - y2_0
  D3y2 <- data$y2[which(data$time==3 & data$id==id)] - y2_0
  D4y2 <- data$y2[which(data$time==4 & data$id==id)] - y2_0
  #Output
  out <- data.frame(id=id, delta=rep(2:4, 2), 
           outcome=c(rep("y1",3), rep("y2",3)),
           change = c(D2y1, D3y1, D4y1,
                      D2y2, D3y2, D4y2))
}
library(purrr)
changes <- map(.x = unique(df1$id), .f = calc_chage, data=df1) %>% 
  map_df(bind_rows)

我的猜测是有一种更有效的方法可以做到这一点。las,我想不到。建议?

要计算自time == 0以来的更改,您可以使用 cumsum + diff;由于总结结果的长度不等于一个,因此首先将其包装在列表中,然后将其包裹在 unnest 中,然后使用gather将结果转换为长格式:

library(tidyverse)
df1 %>% 
    group_by(id) %>% 
    summarise_all(~ list(cumsum(diff(.)))) %>% 
    unnest() %>% rename(delta = time) %>% 
    gather(outcome, change, y1:y2) %>% 
    arrange(id) -> changes2
changes2
# A tibble: 12 x 4
#       id delta outcome     change
#   <fctr> <dbl>   <chr>      <dbl>
# 1     A1     2      y1  2.2827244
# 2     A1     3      y1  2.2070326
# 3     A1     4      y1  1.9530212
# 4     A1     2      y2 -2.1263046
# 5     A1     3      y2 -0.5430784
# 6     A1     4      y2 -0.3109535
# 7     A2     2      y1 -1.8587070
# 8     A2     3      y1 -1.1399270
# 9     A2     4      y1  1.5667202
#10     A2     2      y2 -2.0047108
#11     A2     3      y2 -3.4414667
#12     A2     4      y2 -1.3662450

changes$delta <- as.numeric(changes$delta)
changes$outcome <- as.character(changes$outcome)
all.equal(as.data.frame(changes2), changes)
# [1] TRUE

如果您想依靠基本R函数,我发现aggregate()是发布的其他解决方案的一个很好的选择:

res <- aggregate(x = df1$y2, by = list(df1$id), FUN = function(x) x-x[1], 
                 simplify=T)[-1]
data.frame(df1, delta = c(t(res)))
#   id time         y1          y2      delta
# 1 A1    0  0.9176567 -0.70469232  0.0000000
# 2 A1    2 -0.8258515  0.18032808  0.8850204
# 3 A1    3 -0.8144515 -0.39995370  0.3047386
# 4 A1    4  1.5171310 -0.97107643 -0.2663841
# 5 A2    0  0.1900048 -0.01022439  0.0000000
# 6 A2    2 -0.7181630  0.35408157  0.3643060
# 7 A2    3  0.1379936 -0.34336329 -0.3331389
# 8 A2    4  0.4773945  1.38467064  1.3948950

如果您刚刚在t = 0时拔出值怎么办?可以进一步推广以获得更多的y值。

例如:

library(dplyr)
t0 <- data %>%
  filter(time == 0) %>%
  mutate(t0_y1 = y1,
          t0.y2 = y2) %>%
  select(-time, -y1, -y2)
data <- data %>%
     left_join(t0) %>%
     mutate(change.y1 = y1 - t0_y1,
            change.y2 = y2 - t0_y2)

最新更新