r-使用开始日期和结束日期计算每月的保留率



我有一个数据集,看起来像下面的数据集(但有成千上万行(。我有一个身份证号码,以及开始和结束日期。我需要计算每月的保留率

我将保留率定义为:(月底患者总数-月内开始治疗的患者总数(/(月初患者总数(。

在R中如何使用例如dplyr?

df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
"2019-04-01", "2020-04-01")),
treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31", 
"2020-03-31", "2021-04-30")))

如果我理解正确,您希望总结出每月的平均治疗时间。如果没有,请更具体地说明所需的输出。查看下面的代码。

library(lubridate)
set.seed(2017)
options(digits=4)

df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
"2019-04-01", "2020-04-01")),
treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31", 
"2020-03-31", "2021-04-30")))

df$days <- as.Date(df$treatment_end_date, format="%Y/%m/%d") -
as.Date(df$treatment_start_date, format="%Y/%m/%d")
df_per_month <- df %>%  group_by(month=floor_date(treatment_start_date, "month")) %>%
summarise(mean_month=mean(days))

这就是你的想法吗?针对治疗的开始计算以天为单位的平均治疗期。

所以,我想我明白你想要什么:

  • 每个月底接受治疗的患者人数
  • 每月开始治疗的患者人数
  • 每个月初接受治疗的患者人数*

*这是否包括在本月第一天开始治疗的患者(例如样本数据中的每个人(?-我假设在这个例子中是这样。

因此,加载您的样本数据,并确保日期是正确的date格式

df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
"2019-04-01", "2020-04-01")),
treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31", 
"2020-03-31", "2021-04-30")))
# make sure the dates are date format
df %>% 
as_tibble %>%
mutate(across(treatment_start_date:treatment_end_date, ~ymd(.))) %>% 
{. ->> df_1}
df_1
# # A tibble: 6 x 3
# patient_ID treatment_start_date treatment_end_date
# <chr>      <date>               <date>            
# AA         2004-01-01           2014-12-31        
# BB         2007-01-01           2017-03-31        
# CC         2012-04-01           2018-03-31        
# DD         2014-04-01           2019-03-31        
# EE         2019-04-01           2020-03-31        
# FF         2020-04-01           2021-04-30

然后,我们将每位患者接受治疗的每个日期(包括开始和结束日期(按顺序排列。

# make a sequence of every date a patient was in the treatment
df_1 %>% 
rowwise %>% 
mutate(
treatment_days = list(seq(treatment_start_date, treatment_end_date, by = 'day'))
) %>% 
select(patient_ID, treatment_days) %>% 
unnest(cols = c('treatment_days')) %>% 
{. ->> df_2}
df_2
# # A tibble: 12,539 x 2
# patient_ID treatment_days
# <chr>      <date>        
# AA         2004-01-01    
# AA         2004-01-02    
# AA         2004-01-03    
# AA         2004-01-04    
# AA         2004-01-05    
# AA         2004-01-06    
# AA         2004-01-07    
# AA         2004-01-08    
# AA         2004-01-09    
# AA         2004-01-10    
# # ... with 12,529 more rows

然后,我们计算出每天有多少患者在接受治疗,并只保留每个月的第一天和最后一天。

df_2 %>% 

# work out how many patients were in treatment for each day
group_by(treatment_days) %>% 
summarise(
n_patients = n_distinct(patient_ID)
) %>% 

# make month column
mutate(
month = format(treatment_days, format = '%Y-%m')
) %>% 

# keep only the first and last days of each month
group_by(month) %>% 
filter(
day(treatment_days) == 1 | day(treatment_days) == max(day(treatment_days))
) %>% 

# determine number of patients at the start and end of each month
#    ensure the dates are in order
arrange(month, treatment_days) %>% 
group_by(month) %>% 
summarise(
n_patient_start = nth(n_patients, 1), 
n_patient_end = nth(n_patients, 2), 
) %>% 

{. ->> df_3}
df_3
# # A tibble: 208 x 3
# month   n_patient_start n_patient_end
# <chr>             <int>         <int>
# 2004-01               1             1
# 2004-02               1             1
# 2004-03               1             1
# 2004-04               1             1
# 2004-05               1             1
# 2004-06               1             1
# 2004-07               1             1
# 2004-08               1             1
# 2004-09               1             1
# 2004-10               1             1
# # ... with 198 more rows

因此,现在我们有了每个月初和月底接受治疗的患者总数。

在计算保留率之前,我们需要知道每个月有多少患者开始接受治疗,因此这可以用于计算保留率。

# how many patients started each month?
df_1 %>% 
select(patient_ID, treatment_start_date) %>% 
mutate(
month = format(treatment_start_date, format = '%Y-%m')
) %>% 
group_by(month) %>% 
summarise(
n_starting_patients = n_distinct(patient_ID)
) %>% 
{. ->> n_new_per_month}
n_new_per_month
# # A tibble: 6 x 2
# month   n_starting_patients
# <chr>                 <int>
# 2004-01                   1
# 2007-01                   1
# 2012-04                   1
# 2014-04                   1
# 2019-04                   1
# 2020-04                   1

我们将每个月初和月末的起始患者人数与活跃患者人数相结合。然后,我们可以按照您问题中的公式计算保留率。

# now, we join in new patients per month
df_3 %>% 
left_join(n_new_per_month) %>% 
mutate(
n_starting_patients = ifelse(is.na(n_starting_patients), 0, n_starting_patients)
) %>% 

# calculate retention rate
mutate(
ret_rate = (n_patient_end - n_starting_patients) / n_patient_start
)
# # A tibble: 208 x 5
# month   n_patient_start n_patient_end n_starting_patients ret_rate
# <chr>             <int>         <int>               <dbl>    <dbl>
# 2004-01               1             1                   1        0
# 2004-02               1             1                   0        1
# 2004-03               1             1                   0        1
# 2004-04               1             1                   0        1
# 2004-05               1             1                   0        1
# 2004-06               1             1                   0        1
# 2004-07               1             1                   0        1
# 2004-08               1             1                   0        1
# 2004-09               1             1                   0        1
# 2004-10               1             1                   0        1
# # ... with 198 more rows

现在我不太确定这是否正确,因为例如在上面的预览中,我们有2004年1月的0的保留率,尽管这个月没有病人流失(事实上我们有一次开始治疗(。这是因为患者AA于1月1日开始,因此保留率计算为(number of patients at the end of the month - number of patients that started in that month) / number of patients at the start of the month,或(1 - 1) / 1=0 / 1=0

当前公式中会影响保留率的因素:

  • 患者接受治疗的日期中是否包括开始日期和结束日期
  • 如果患者在本月的第一天开始,这是否意味着他们在"本月初的患者"中被包括或删除?如果你想删除月中开始的患者数量,我可以理解,但用目前的公式,这对我来说没有太大意义

最新更新