这是我正在使用的数据框架:
df <- tribble(
~Patient, ~date, ~Doctor
"A", "2020-01-01", "A",
"A", "2020-03-01", "A",
"A", "2020-04-30", "B",
"A", "2020-06-29", "C",
"A", "2020-08-28", "A",
"B", "2020-01-01", "A",
"B", "2020-03-01","B",
"B", "2020-04-30","B",
"B", "2020-06-29","B",
"B", "2020-08-28","C",
"C", "2020-04-30","A",
"C", "2020-06-29","A",
"C", "2020-08-28","B",
"C", "2020-10-27","C",
"C", "2020-12-26","A",
)
可以看到,有三列:Patient
,date
和Doctor
。
这是我想要的数据框架。
desired_df <- tribble(
~Patient, ~Number_of_Diff_Doctors_within_180_days,
"A", "3",
"B", "2",
"C", "3",
)
逻辑如下:我试图返回一个数据帧,其中包含每个患者的唯一值以及该患者在180天窗口内看过的医生数量。这180天就像一个移动的窗口,我们的任务是计算出在任何180天窗口内为病人看病的最大人数。
在示例中,患者A在2020-03-01到2020-06-29之间有三个不同的医生,分别是医生A、B和C,即<180窗口,因此该患者得到三个医生对应的代码1。但是病人B也有三个医生,他在2020-01-01看医生A,在2020-08-28看医生C,所以在任何180天的窗口中只有两个医生。病人C和病人A在时间间隔上是一样的,只是时间提前了。
这是我到目前为止的尝试。它没有做任何关于日期逻辑的事情,因为我不知道我在做什么。
attempt <- df %>%
dplyr::select(Patient, Doctor) %>%
dplyr::group_by(Patient, Doctor) %>%
distinct() %>%
dplyr::group_by(Patient) %>%
tally() %>%
filter(n > 1)
使用runner
包进行这样的滚动窗口计算。这是美妙的。
library(tidyverse)
library(lubridate)
library(runner)
df <- tribble(
~Patient, ~date, ~Doctor,
"A", "2020-01-01", "A",
"A", "2020-03-01", "A",
"A", "2020-04-30", "B",
"A", "2020-06-29", "C",
"A", "2020-08-28", "A",
"B", "2020-01-01", "A",
"B", "2020-03-01","B",
"B", "2020-04-30","B",
"B", "2020-06-29","B",
"B", "2020-08-28","C",
"C", "2020-04-30","A",
"C", "2020-06-29","A",
"C", "2020-08-28","B",
"C", "2020-10-27","C",
"C", "2020-12-26","A",
) %>%
mutate(date = ymd(date))
df %>%
group_by(Patient) %>%
mutate(num_docs = runner(Doctor, n_distinct, k = 180, idx = date)) %>%
summarize(num_docs = max(num_docs))
# A tibble: 3 × 2
Patient num_docs
<chr> <int>
1 A 3
2 B 2
3 C 3
根据OP编辑更新解决方案。
首先,让我们得到一个整洁的数据框架,其中包含患者就诊的累计天数:
df2 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(Patient) %>%
mutate(days_btwn = replace_na(day(days(date - lag(date))), 0),
cum_days = cumsum(days_btwn)) %>%
ungroup
df2
输出示例:
# A tibble: 15 × 5
Patient date Doctor days_btwn cum_days
<chr> <date> <chr> <dbl> <dbl>
1 A 2020-01-01 A 0 0
2 A 2020-03-01 A 60 60
3 A 2020-04-30 B 60 120
4 A 2020-06-29 C 60 180
5 A 2020-08-28 A 60 240
6 B 2020-01-01 A 0 0
#...
接下来,我们可以遍历每个Patient
(基本上是分组操作),并迭代地采样访问周期的滚动窗口。计算总天数为<= 180的每个窗口中唯一Doctor
值的最大个数,并将所有患者的结果合并到一个数据帧中。
unique(df2$Patient) %>%
map_dfr(function(pat) {
this_pat <- df2 %>% filter(Patient == pat)
n_obs <- nrow(this_pat)
max_docs <- n_distinct(this_pat$Doctor)
n_docs <- 0
max_win_docs <- 0
for (i in 1:n_obs) {
for (j in 1:n_obs) {
win_days <- abs(this_pat$cum_days[j] - this_pat$cum_days[i])
if (win_days <= 180) {
n_docs <- n_distinct(this_pat %>% slice(i:j) %>% select(Doctor))
if (n_docs > max_win_docs) max_win_docs <- n_docs
if (max_win_docs == max_docs) next
}
}
}
list(patient = pat, n_diff_docs_within_180 = max_win_docs)
}
)
输出# A tibble: 3 × 2
patient n_diff_docs_within_180
<chr> <int>
1 A 3
2 B 2
3 C 3
你所说的" 180天内"是什么意思有点模糊。从哪一天算起的180天内?
这决定了每位患者每次就诊后180天内不同医生的就诊次数。
library(data.table)
setDT(df)[, date:=as.Date(date)]
df[, date.hi:=date+180]
result <- df[df, on=.(Patient, date>=date, date<=date.hi)]
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)]
Patient date count
## 1: A 2020-01-01 3
## 2: A 2020-03-01 3
## 3: A 2020-04-30 3
## 4: A 2020-06-29 2
## 5: A 2020-08-28 1
## 6: B 2020-01-01 2
## 7: B 2020-03-01 2
## 8: B 2020-04-30 2
## 9: B 2020-06-29 2
## 10: B 2020-08-28 1
## 11: C 2020-04-30 3
## 12: C 2020-06-29 3
## 13: C 2020-08-28 3
## 14: C 2020-10-27 2
## 15: C 2020-12-26 1
因此,患者A在2020-01-01(第1行)的180天内看了3位医生,但在2020-06-29(第4行)的180天内只看了2位医生。显然,如果数据集在给定日期后不到180天结束,我们真的不知道在该时间段内会发生的就诊次数。
你的问题的预期结果似乎是基于每个病人的第一次就诊。我们可以提取如下:
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][, .SD[1], by=.(Patient)]
## Patient date count
## 1: A 2020-01-01 3
## 2: B 2020-01-01 2
## 3: C 2020-04-30 3
编辑:基于OP评论。每个病人的最大计数由
给出result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][
, .(maxCount=max(count)), by=.(Patient)]
## Patient maxCount
## 1: A 3
## 2: B 2
## 3: C 3