r语言 - 有条件的组的最后日期



这个问题是这个问题的后续问题,但每个idPerson可以有多个decision == "d"。有多种idPerson,但有一个足以解释这个问题。idAppt嵌套在idPerson中。考虑此数据框。

idPerson idAppt decision date      
1 A             1 a        2021-09-10
2 A             1 b        2021-09-11
3 A             1 c        2021-09-12
4 A             1 d        2021-09-13
5 A             2 a        2021-09-20
6 A             2 b        2021-09-21
7 A             3 a        2021-09-10
8 A             3 b        2021-09-11
9 A             4 a        2021-09-21
10 A             4 b        2021-09-22
11 A             4 c        2021-09-23
12 A             4 d        2021-09-24
13 A             5 a        2021-09-10
14 A             5 b        2021-09-11
15 A             6 a        2021-10-10
16 A             6 b        2021-10-11

我想构建一个date2列来回复这些条件:

  • 对于给定的idAppt,如果decision == "a"晚于任何其他日期,当decision == "d"同一idPerson时,报告该idPersondecision == "d"时的最新值date(最接近的之前)。例如,在组idAppt == 2中,decision == "a"的日期晚于组idAppt == 1decision == "d"日期,因此date22021-09-13。这同样适用于组idAppt == 6,但这里有两个更早的decision == "d"(第 4 行和第 12 行)。在这种情况下,date2应该是2021-10-10之前最接近的,即2021-09-23.
  • 当给定idAppt没有早于decision == "a"datedecision == "d"date时,取给定idPerson中最早的。

这给出了以下所需的输出:

idPerson idAppt decision date       date2       
1 A             1 a        2021-09-10 2021-09-10
2 A             1 b        2021-09-11 2021-09-10
3 A             1 c        2021-09-12 2021-09-10
4 A             1 d        2021-09-13 2021-09-10
5 A             2 a        2021-09-20 2021-09-13 #<- correspond to value of row 4
6 A             2 b        2021-09-21 2021-09-13  
7 A             3 a        2021-09-10 2021-09-10 
8 A             3 b        2021-09-11 2021-09-10
9 A             4 a        2021-09-21 2021-09-13
10 A             4 b        2021-09-22 2021-09-13
11 A             4 c        2021-09-23 2021-09-13
12 A             4 d        2021-09-24 2021-09-13
13 A             5 a        2021-09-11 2021-09-10 #<- earliest value because 2021-09-10 is earlier than 2021-09-13
14 A             5 b        2021-09-12 2021-09-10
15 A             6 a        2021-10-10 2021-09-24 #<- correspond to value of row 12
16 A             6 b        2021-10-11 2021-09-24
<小时 />

数据

df <- structure(list(idPerson = c("A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A"), idAppt = c(1L, 
1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L), 
decision = c("a", "b", "c", "d", "a", "b", "a", "b", "a", 
"b", "c", "d", "a", "b", "a", "b"), date = structure(c(18880, 
18881, 18882, 18883, 18890, 18891, 18880, 18881, 18891, 18892, 
18893, 18894, 18881, 18882, 18910, 18911), class = "Date")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -16L))
EO <- structure(list(idPerson = c("A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A"), idAppt = c(1L, 
1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 6L, 6L), 
decision = c("a", "b", "c", "d", "a", "b", "a", "b", "a", 
"b", "c", "d", "a", "b", "a", "b"), date = structure(c(18880, 
18881, 18882, 18883, 18890, 18891, 18880, 18881, 18891, 18892, 
18893, 18894, 18881, 18882, 18910, 18911), class = "Date"), 
date2 = c("2021-09-10", "2021-09-10", "2021-09-10", "2021-09-10", 
"2021-09-13", "2021-09-13", "2021-09-10", "2021-09-10", "2021-09-13", 
"2021-09-13", "2021-09-13", "2021-09-13", "2021-09-10", "2021-09-10", 
"2021-09-24", "2021-09-24")), row.names = c(NA, -16L), class = c("tbl_df", 
"tbl", "data.frame"))

使用data.table滚动连接:

library(data.table)
setDT(df)
# rolling join between decision "d" and "a"
df[decision == "a", date2 := df[decision == "d"][.SD, on = .(idPerson, date), x.date, roll = Inf]]
# set non-matching rows for decision "a" to min(date)
df[decision == "a" & is.na(date2), date2 := min(date), by = idPerson]
# replace other NA by last observation carried forward
setnafill(df, type = "locf", cols = "date2")
idPerson idAppt decision       date      date2
1:        A      1        a 2021-09-10 2021-09-10
2:        A      1        b 2021-09-11 2021-09-10
3:        A      1        c 2021-09-12 2021-09-10
4:        A      1        d 2021-09-13 2021-09-10
5:        A      2        a 2021-09-20 2021-09-13
6:        A      2        b 2021-09-21 2021-09-13
7:        A      3        a 2021-09-10 2021-09-10
8:        A      3        b 2021-09-11 2021-09-10
9:        A      4        a 2021-09-21 2021-09-13
10:        A      4        b 2021-09-22 2021-09-13
11:        A      4        c 2021-09-23 2021-09-13
12:        A      4        d 2021-09-24 2021-09-13
13:        A      5        a 2021-09-11 2021-09-10
14:        A      5        b 2021-09-12 2021-09-10
15:        A      6        a 2021-10-10 2021-09-24
16:        A      6        b 2021-10-11 2021-09-24

"idAppt"的相关性并不完全清楚,因为日期的比较似乎是在idPerson中进行的。

这是我解决问题的方法,尽管它看起来有点复杂:

library(dplyr)
df %>%
group_by(idPerson) %>%
mutate(d_date = list(date[decision == "d"]), min_date_person = min(date)) %>% 
group_by(idPerson, idAppt) %>%
mutate(date3 = unlist(map(d_date, (x){
dates <- date[decision == "a"] - x
w <- which.min(dates[dates > 0])
ifelse(is.null(w), NA, w)
})),
date2 = if_else(is.na(date3), min_date_person, do.call("c", map(d_date, ~ unique(.x[date3]))))) %>% 
ungroup() %>% 
select(1:4, date2)
# A tibble: 16 × 5
idPerson idAppt decision date       date2     
<chr>     <int> <chr>    <date>     <date>    
1 A             1 a        2021-09-10 2021-09-10
2 A             1 b        2021-09-11 2021-09-10
3 A             1 c        2021-09-12 2021-09-10
4 A             1 d        2021-09-13 2021-09-10
5 A             2 a        2021-09-20 2021-09-13
6 A             2 b        2021-09-21 2021-09-13
7 A             3 a        2021-09-10 2021-09-10
8 A             3 b        2021-09-11 2021-09-10
9 A             4 a        2021-09-21 2021-09-13
10 A             4 b        2021-09-22 2021-09-13
11 A             4 c        2021-09-23 2021-09-13
12 A             4 d        2021-09-24 2021-09-13
13 A             5 a        2021-09-11 2021-09-10
14 A             5 b        2021-09-12 2021-09-10
15 A             6 a        2021-10-10 2021-09-24
16 A             6 b        2021-10-11 2021-09-24

最新更新