我尝试计算每个组id的第二行和最后一行之间的日期差。数据看起来像
data<- data.frame(pid= c(1, 1, 1,1, 2, 2, 2, 3, 3, 3,3 ,3), day = c("25/07/2018", "19/10/2018", "17/01/2019", "19/03/2019", "10/09/2018","29/11/2018", "26/03/2019", "17/06/2016", "25/04/2018", "17/07/2018","05/04/2019", "09/02/2021"), catt=c(1,1,2,1,1,1,2,2,2,1,1,2))
数据<表类>pid 天 tbody><<tr>1 1 25/07/2018 21 19/10/2018 3 1 17/01/2019 41 19/03/2019 52 10/09/2018 62 29/11/2018 72 26/03/2019 83 17/06/2016 9 3 25/04/2018 103 17/07/2018 113 05/04/2019 12 3 09/02/2021 表类>
转换为日期对象并计算每个pid
最后和第二个日期的差值
library(dplyr)
library(lubridate)
data %>%
mutate(day = dmy(day)) %>%
arrange(pid, day) %>%
group_by(pid) %>%
summarise(difference = (last(day) - day[2])/30)
# pid difference
# <dbl> <dbl>
#1 1 5.03
#2 2 3.9
#3 3 34.0
如果你想保持数据框的行数,使用mutate
,只替换数据框最后一行的difference
。
data %>%
mutate(day = dmy(day)) %>%
arrange(pid, day) %>%
group_by(pid) %>%
mutate(difference = ifelse(row_number() == n(), (last(day) - day[2])/30, NA))
注意问题中difftime
的输出不正确。
#Wrong output
difftime("19/10/2018","19/03/2019 ", units = "days")
#Time difference of 214 days
#Correct output
difftime(dmy("19/03/2019"), dmy("19/10/2018"), units = "days")
#Time difference of 151 days