R按另一个变量的计数过滤一个变量,但只在一天间隔内计数

  • 本文关键字:变量 一天 一个 另一个 过滤 r dplyr
  • 更新时间 :
  • 英文 :


这是我正在使用的数据框架:

df <- tribble(
~Patient, ~date, ~Doctor
"A", "2020-01-01", "A",
"A", "2020-03-01", "A",
"A", "2020-04-30", "B",
"A", "2020-06-29", "C",
"A", "2020-08-28", "A",
"B", "2020-01-01", "A",
"B", "2020-03-01","B",
"B", "2020-04-30","B",
"B", "2020-06-29","B",
"B", "2020-08-28","C",
"C", "2020-04-30","A",
"C", "2020-06-29","A",
"C", "2020-08-28","B",
"C", "2020-10-27","C",
"C", "2020-12-26","A",
)

可以看到,有三列:Patient,dateDoctor

这是我想要的数据框架。

desired_df <- tribble(
~Patient, ~Number_of_Diff_Doctors_within_180_days, 
"A", "3", 
"B", "2", 
"C", "3", 
)

逻辑如下:我试图返回一个数据帧,其中包含每个患者的唯一值以及该患者在180天窗口内看过的医生数量。这180天就像一个移动的窗口,我们的任务是计算出在任何180天窗口内为病人看病的最大人数。

在示例中,患者A在2020-03-01到2020-06-29之间有三个不同的医生,分别是医生A、B和C,即<180窗口,因此该患者得到三个医生对应的代码1。但是病人B也有三个医生,他在2020-01-01看医生A,在2020-08-28看医生C,所以在任何180天的窗口中只有两个医生。病人C和病人A在时间间隔上是一样的,只是时间提前了。

这是我到目前为止的尝试。它没有做任何关于日期逻辑的事情,因为我不知道我在做什么。

attempt <- df %>%
dplyr::select(Patient, Doctor) %>%
dplyr::group_by(Patient, Doctor) %>%
distinct() %>%
dplyr::group_by(Patient) %>%
tally() %>%
filter(n > 1)

使用runner包进行这样的滚动窗口计算。这是美妙的。

library(tidyverse)
library(lubridate)
library(runner)

df <- tribble(
~Patient, ~date, ~Doctor,
"A", "2020-01-01", "A",
"A", "2020-03-01", "A",
"A", "2020-04-30", "B",
"A", "2020-06-29", "C",
"A", "2020-08-28", "A",
"B", "2020-01-01", "A",
"B", "2020-03-01","B",
"B", "2020-04-30","B",
"B", "2020-06-29","B",
"B", "2020-08-28","C",
"C", "2020-04-30","A",
"C", "2020-06-29","A",
"C", "2020-08-28","B",
"C", "2020-10-27","C",
"C", "2020-12-26","A",
) %>% 
mutate(date = ymd(date))
df %>% 
group_by(Patient) %>% 
mutate(num_docs = runner(Doctor, n_distinct, k = 180, idx = date)) %>% 
summarize(num_docs = max(num_docs))
# A tibble: 3 × 2
Patient num_docs
<chr>      <int>
1 A              3
2 B              2
3 C              3

根据OP编辑更新解决方案。

首先,让我们得到一个整洁的数据框架,其中包含患者就诊的累计天数:

df2 <- df %>% 
mutate(date = as.Date(date)) %>% 
group_by(Patient) %>% 
mutate(days_btwn = replace_na(day(days(date - lag(date))), 0),
cum_days = cumsum(days_btwn)) %>% 
ungroup

df2输出示例:

# A tibble: 15 × 5
Patient date       Doctor days_btwn cum_days
<chr>   <date>     <chr>      <dbl>    <dbl>
1 A       2020-01-01 A              0        0
2 A       2020-03-01 A             60       60
3 A       2020-04-30 B             60      120
4 A       2020-06-29 C             60      180
5 A       2020-08-28 A             60      240
6 B       2020-01-01 A              0        0
#...

接下来,我们可以遍历每个Patient(基本上是分组操作),并迭代地采样访问周期的滚动窗口。计算总天数为<= 180的每个窗口中唯一Doctor值的最大个数,并将所有患者的结果合并到一个数据帧中。


unique(df2$Patient) %>% 
map_dfr(function(pat) {
this_pat <- df2 %>% filter(Patient == pat)
n_obs <- nrow(this_pat)
max_docs <- n_distinct(this_pat$Doctor)
n_docs <- 0
max_win_docs <- 0
for (i in 1:n_obs) {
for (j in 1:n_obs) {
win_days <- abs(this_pat$cum_days[j] - this_pat$cum_days[i])
if (win_days <= 180) {
n_docs <- n_distinct(this_pat %>% slice(i:j) %>% select(Doctor))
if (n_docs > max_win_docs) max_win_docs <- n_docs
if (max_win_docs == max_docs) next
}
}
}
list(patient = pat, n_diff_docs_within_180 = max_win_docs)
}
)

输出
# A tibble: 3 × 2
patient n_diff_docs_within_180
<chr>                    <int>
1 A                            3
2 B                            2
3 C                            3

你所说的" 180天内"是什么意思有点模糊。从哪一天算起的180天内?

这决定了每位患者每次就诊后180天内不同医生的就诊次数。

library(data.table)
setDT(df)[, date:=as.Date(date)]
df[, date.hi:=date+180]
result <- df[df, on=.(Patient, date>=date, date<=date.hi)]
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)]
Patient       date count
##  1:       A 2020-01-01     3
##  2:       A 2020-03-01     3
##  3:       A 2020-04-30     3
##  4:       A 2020-06-29     2
##  5:       A 2020-08-28     1
##  6:       B 2020-01-01     2
##  7:       B 2020-03-01     2
##  8:       B 2020-04-30     2
##  9:       B 2020-06-29     2
## 10:       B 2020-08-28     1
## 11:       C 2020-04-30     3
## 12:       C 2020-06-29     3
## 13:       C 2020-08-28     3
## 14:       C 2020-10-27     2
## 15:       C 2020-12-26     1

因此,患者A在2020-01-01(第1行)的180天内看了3位医生,但在2020-06-29(第4行)的180天内只看了2位医生。显然,如果数据集在给定日期后不到180天结束,我们真的不知道在该时间段内会发生的就诊次数。

你的问题的预期结果似乎是基于每个病人的第一次就诊。我们可以提取如下:

result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][, .SD[1], by=.(Patient)]
##    Patient       date count
## 1:       A 2020-01-01     3
## 2:       B 2020-01-01     2
## 3:       C 2020-04-30     3

编辑:基于OP评论。每个病人的最大计数由

给出
result[, .(count=uniqueN(Doctor)), by=.(Patient, date)][
, .(maxCount=max(count)),   by=.(Patient)]
##    Patient maxCount
## 1:       A        3
## 2:       B        2
## 3:       C        3

最新更新