我有一个数据集,看起来像下面的数据集(但有成千上万行(。我有一个身份证号码,以及开始和结束日期。我需要计算每月的保留率。
我将保留率定义为:(月底患者总数-月内开始治疗的患者总数(/(月初患者总数(。
在R中如何使用例如dplyr?
df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
"2019-04-01", "2020-04-01")),
treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31",
"2020-03-31", "2021-04-30")))
如果我理解正确,您希望总结出每月的平均治疗时间。如果没有,请更具体地说明所需的输出。查看下面的代码。
library(lubridate)
set.seed(2017)
options(digits=4)
df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
"2019-04-01", "2020-04-01")),
treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31",
"2020-03-31", "2021-04-30")))
df$days <- as.Date(df$treatment_end_date, format="%Y/%m/%d") -
as.Date(df$treatment_start_date, format="%Y/%m/%d")
df_per_month <- df %>% group_by(month=floor_date(treatment_start_date, "month")) %>%
summarise(mean_month=mean(days))
这就是你的想法吗?针对治疗的开始计算以天为单位的平均治疗期。
所以,我想我明白你想要什么:
- 每个月底接受治疗的患者人数
- 每月开始治疗的患者人数
- 每个月初接受治疗的患者人数*
*这是否包括在本月第一天开始治疗的患者(例如样本数据中的每个人(?-我假设在这个例子中是这样。
因此,加载您的样本数据,并确保日期是正确的date
格式
df <- data.frame(patient_ID= c("AA", "BB", "CC", "DD", "EE", "FF"),
treatment_start_date = as.Date(c("2004-01-01", "2007-01-01", "2012-04-01", "2014-04-01",
"2019-04-01", "2020-04-01")),
treatment_end_date = as.Date(c("2014-12-31", "2017-03-31", "2018-03-31", "2019-03-31",
"2020-03-31", "2021-04-30")))
# make sure the dates are date format
df %>%
as_tibble %>%
mutate(across(treatment_start_date:treatment_end_date, ~ymd(.))) %>%
{. ->> df_1}
df_1
# # A tibble: 6 x 3
# patient_ID treatment_start_date treatment_end_date
# <chr> <date> <date>
# AA 2004-01-01 2014-12-31
# BB 2007-01-01 2017-03-31
# CC 2012-04-01 2018-03-31
# DD 2014-04-01 2019-03-31
# EE 2019-04-01 2020-03-31
# FF 2020-04-01 2021-04-30
然后,我们将每位患者接受治疗的每个日期(包括开始和结束日期(按顺序排列。
# make a sequence of every date a patient was in the treatment
df_1 %>%
rowwise %>%
mutate(
treatment_days = list(seq(treatment_start_date, treatment_end_date, by = 'day'))
) %>%
select(patient_ID, treatment_days) %>%
unnest(cols = c('treatment_days')) %>%
{. ->> df_2}
df_2
# # A tibble: 12,539 x 2
# patient_ID treatment_days
# <chr> <date>
# AA 2004-01-01
# AA 2004-01-02
# AA 2004-01-03
# AA 2004-01-04
# AA 2004-01-05
# AA 2004-01-06
# AA 2004-01-07
# AA 2004-01-08
# AA 2004-01-09
# AA 2004-01-10
# # ... with 12,529 more rows
然后,我们计算出每天有多少患者在接受治疗,并只保留每个月的第一天和最后一天。
df_2 %>%
# work out how many patients were in treatment for each day
group_by(treatment_days) %>%
summarise(
n_patients = n_distinct(patient_ID)
) %>%
# make month column
mutate(
month = format(treatment_days, format = '%Y-%m')
) %>%
# keep only the first and last days of each month
group_by(month) %>%
filter(
day(treatment_days) == 1 | day(treatment_days) == max(day(treatment_days))
) %>%
# determine number of patients at the start and end of each month
# ensure the dates are in order
arrange(month, treatment_days) %>%
group_by(month) %>%
summarise(
n_patient_start = nth(n_patients, 1),
n_patient_end = nth(n_patients, 2),
) %>%
{. ->> df_3}
df_3
# # A tibble: 208 x 3
# month n_patient_start n_patient_end
# <chr> <int> <int>
# 2004-01 1 1
# 2004-02 1 1
# 2004-03 1 1
# 2004-04 1 1
# 2004-05 1 1
# 2004-06 1 1
# 2004-07 1 1
# 2004-08 1 1
# 2004-09 1 1
# 2004-10 1 1
# # ... with 198 more rows
因此,现在我们有了每个月初和月底接受治疗的患者总数。
在计算保留率之前,我们需要知道每个月有多少患者开始接受治疗,因此这可以用于计算保留率。
# how many patients started each month?
df_1 %>%
select(patient_ID, treatment_start_date) %>%
mutate(
month = format(treatment_start_date, format = '%Y-%m')
) %>%
group_by(month) %>%
summarise(
n_starting_patients = n_distinct(patient_ID)
) %>%
{. ->> n_new_per_month}
n_new_per_month
# # A tibble: 6 x 2
# month n_starting_patients
# <chr> <int>
# 2004-01 1
# 2007-01 1
# 2012-04 1
# 2014-04 1
# 2019-04 1
# 2020-04 1
我们将每个月初和月末的起始患者人数与活跃患者人数相结合。然后,我们可以按照您问题中的公式计算保留率。
# now, we join in new patients per month
df_3 %>%
left_join(n_new_per_month) %>%
mutate(
n_starting_patients = ifelse(is.na(n_starting_patients), 0, n_starting_patients)
) %>%
# calculate retention rate
mutate(
ret_rate = (n_patient_end - n_starting_patients) / n_patient_start
)
# # A tibble: 208 x 5
# month n_patient_start n_patient_end n_starting_patients ret_rate
# <chr> <int> <int> <dbl> <dbl>
# 2004-01 1 1 1 0
# 2004-02 1 1 0 1
# 2004-03 1 1 0 1
# 2004-04 1 1 0 1
# 2004-05 1 1 0 1
# 2004-06 1 1 0 1
# 2004-07 1 1 0 1
# 2004-08 1 1 0 1
# 2004-09 1 1 0 1
# 2004-10 1 1 0 1
# # ... with 198 more rows
现在我不太确定这是否正确,因为例如在上面的预览中,我们有2004年1月的0
的保留率,尽管这个月没有病人流失(事实上我们有一次开始治疗(。这是因为患者AA于1月1日开始,因此保留率计算为(number of patients at the end of the month - number of patients that started in that month) / number of patients at the start of the month
,或(1 - 1) / 1
=0 / 1
=0
。
当前公式中会影响保留率的因素:
- 患者接受治疗的日期中是否包括开始日期和结束日期
- 如果患者在本月的第一天开始,这是否意味着他们在"本月初的患者"中被包括或删除?如果你想删除月中开始的患者数量,我可以理解,但用目前的公式,这对我来说没有太大意义