等于R中的max和over(除以),用于平坦化



我的数据为:

ID time0 obs_num recorded_dt day0 day1 day2 day3 day4 day5…day311 2009-01-01 a 2009-01-01 a null null null null null null零1 2009-01-01 d 2009-01-31 null null null null null null null null…D1 2009-01-01 b 2009-01-05 null null null null null b…零b null null null null null null…NULL

数据可以复制为:

example = data.frame(
ID = c(1,1,1,2),
time0 = c('2009-01-01','2009-01-01','2009-01-01','2005-02-02'),
obs_num = c('A','D','B','B'),
recorded_dt = c('2009-01-01','2009-01-31','2009-01-05','2005-02-03')
)
library(tidyverse)
df <- example %>% 
mutate(difs_days = floor(difftime(recorded_dt, time0, units="days"))) %>% 
arrange(difs_days) %>% 
pivot_wider(names_from = difs_days, values_from = obs_num, names_prefix = 'day') %>% 
arrange(ID, recorded_dt)
df
# # A tibble: 4 × 7
#      ID time0      recorded_dt day0  day1  day4  day30
#   <dbl> <chr>      <chr>       <chr> <chr> <chr> <chr>
# 1     1 2009-01-01 2009-01-01  A     NA    NA    NA   
# 2     1 2009-01-01 2009-01-05  NA    NA    B     NA   
# 3     1 2009-01-01 2009-01-31  NA    NA    NA    D    
# 4     2 2005-02-02 2005-02-03  NA    B     NA    NA

我想把数据平铺成:

ID time0 day0 day1 day2 day3 day4 day5…day311 2009-01-01 a null null null b…D2 2005-02-02 null b null null null null…NULL

在SQL中,我会使用max(dayX) over(partition by ID)作为XYZ,然后保持不同的值。我想r一定有有效的方法,你能帮帮我吗?

您可以使用across():

汇总多个列
df %>%
group_by(ID, time0) %>%
summarise(across(day0:day30, ~ if(all(is.na(.x))) NA else max(.x, na.rm = TRUE))) %>%
ungroup()
# # A tibble: 2 × 6
#      ID time0      day0  day1  day4  day30
#   <dbl> <chr>      <chr> <chr> <chr> <chr>
# 1     1 2009-01-01 A     NA    B     D    
# 2     2 2005-02-02 NA    B     NA    NA

更新:

对于给定的示例,我们可以使用:summarise(across(everything(), ~trimws(paste(., collapse = ''))))

要将""替换为NA,只需在代码末尾添加na_if(""):

library(dplyr)
example %>%
select(-recorded_dt) %>% 
mutate(across(everything(), ~ifelse(is.na(.), "", .))) %>% 
group_by(ID, time0) %>% 
summarise(across(everything(), ~trimws(paste(., collapse = '')))) %>% 
na_if("")
ID time0      day0  day1  day4  day30
<dbl> <chr>      <chr> <chr> <chr> <chr>
1     1 2009-01-01 "A"   ""    "B"   "D"  
2     2 2005-02-02 ""    "B"   ""    ""   

使用data.table:

library(data.table)
setDT(df)[
, recorded_dt:=NULL][
, lapply(.SD, (x) sort(x, na.last = TRUE, decreasing = TRUE)[1])
, by=.(ID, time0)]
##    ID      time0 day0 day1 day4 day30
## 1:  1 2009-01-01    A <NA>    B     D
## 2:  2 2005-02-02 <NA>    B <NA>  <NA>

内部变量.SD表示data.table的一个子集,包括除了by=...子句中包含的所有列。这就是为什么我们必须先删除recorded_dt列。

最新更新