R(dplyr):查找行特定范围*WITH RESTRICTION*中的所有行



我有一个数据集,其中每行都由住院id标识。每行都包含住院id、入院和出院日期的信息,以及发生住院的医院和负责住院的医生的标识。

我想知道,对于每次住院,同一家医院的其他医生在给定住院开始前30天内完成的所有其他住院的id。

以下是由2家医院的2名医生进行的8次住院治疗的简单示例(医生可能在多家医院工作(。

library("tidyverse")
df <- data.frame(hospitalization_id = c(1, 2, 3, 
4, 5, 
6, 7, 8),
hospital_id = c("A", "A", "A", 
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1, 
2, 2, 
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))

使用R(dplyr(中发布的解决方案:查找行特定范围内的所有行:首先,我查找在给定医院开始给定住院治疗前30天内的所有其他住院治疗;然后,我放弃了由同一位医生进行的住院治疗。

df_with_date_range <- df %>%
mutate(date_range1 = date_start - 31,
date_range2 = date_start - 1)
df_semifinal <- df_with_date_range %>%
rowwise() %>%
mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id),
function(x, y, z) ungroup(filter(rowwise(df_with_date_range),
between(date_end, x, y),
hospital_id == z))$hospitalization_id)) %>%
unnest(hospital_id_in_range, keep_empty = TRUE)
df_final <- df_semifinal %>% 
left_join(select(df, hospitalization_id, physician_id),
by = c('hospital_id_in_range' = 'hospitalization_id')) %>%
mutate(hospital_id_in_range = ifelse(physician_id.x == physician_id.y, NA, hospital_id_in_range)) %>%
select(-physician_id.y) %>%
rename(physician_id = physician_id.x) %>%
distinct()

考虑到我的数据量很大,我正在努力编写一个更高效的代码——理想情况下,我希望避免只添加所有住院病例,然后删除指定医生执行的住院病例。

在我以前的解决方案中为医生添加另一个过滤标准怎么样:

df_with_date_range %>%
mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id),
function(x, y, z, p) filter(df_with_date_range,
date_start >= x & date_start <= y,
hospital_id == z,
physician_id != p)$hospitalization_id)) %>%
unnest(hospital_id_in_range, keep_empty = TRUE)
# # A tibble: 9 × 8
#   hospitalization_id hospital_id physician_id date_start date_end   date_range1 date_range2 hospital_id_in_range
#                <dbl> <chr>              <dbl> <date>     <date>     <date>      <date>                     <dbl>
# 1                  1 A                      1 2000-01-01 2000-01-03 1999-12-01  1999-12-31                    NA
# 2                  2 A                      1 2000-01-12 2000-01-18 1999-12-12  2000-01-11                    NA
# 3                  3 A                      1 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     4
# 4                  4 A                      2 2000-01-12 2000-01-18 1999-12-12  2000-01-11                     1
# 5                  5 A                      2 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     1
# 6                  5 A                      2 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     2
# 7                  6 B                      2 2000-02-10 2000-02-11 2000-01-10  2000-02-09                    NA
# 8                  7 B                      2 2000-02-11 2000-02-14 2000-01-11  2000-02-10                    NA
# 9                  8 B                      2 2000-02-12 2000-02-17 2000-01-12  2000-02-11                    NA

此外,我将between切换到不需要第二个rowwise的更好的矢量化实现。使用更多的rowwise往往会减慢该过程。这个应该快一点。

预期的输出可能与您的有点不同,因为我在过程中进行了筛选,而不是在之后用NA替换。如果你想用NA替换医生副本,请告诉我,我可以对过滤后的数据帧的mutate进行编码。

最新更新