r-在具有不同级别限制的特定行范围中查找一组行



我有一个数据集,其中每一行都由住院id-医生id标识。每一行还包含入院和出院日期以及住院医院的信息。住院治疗可能涉及多名医生。医生可以在多家医院工作。

我有另一个数据集,里面有每个医生的专业信息(例如临床医生、心脏病专家(。医生可能有多种专业。

我想知道,对于每个住院医师行,在该住院开始前30天内由同一专业的其他医生在同一家医院进行的所有其他住院的id。

部分使用R(dplyr(中发布的解决方案:查找特定行范围内的所有行*WITH RESTRICTION*,我设法编写了一个代码,查找给定住院开始前30天内同一医院其他医生执行的所有住院治疗。对于每一行,我首先找到了30天内某家医院的所有住院患者名单。然后,我发现了一份清单,其中只包括自我医生参与的住院情况。最后,我选择了第一个列表中不在第二个列表中的元素。

我想调整代码,以查找与自我医生至少有一个专业的其他医生的住院情况。理想情况下,我想更改上面代码的第一步,以查找自我医生专业范围内的所有住院患者列表。然后我可以使用代码的其余部分,从这个列表中减去涉及自我医生的住院人数。这里的主要困难是医生可能有多种专业——否则,只需要在过滤函数中包含另一个变量。

下面是我现在的代码——它没有考虑到自我医生的专业。

df <- data.frame(hospitalization_id = c(1, 2, 3,
1, 2, 3,
4, 5, 
6, 7, 8),
hospital_id = c("A", "A", "A", 
"A", "A", "A", 
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1, 
2, 2, 2,
3, 3, 
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))
df2 <- df %>%
mutate(
# Generates 30-day time interval before start of given hospitalization 
date_range1 = date_start - 30,
date_range2 = date_start - 1,
# List of all hospitalizations in given hospital, in time interval
hospid_all = pmap(list(date_range1, date_range2, hospital_id),
function(x, y, z) filter(df,
date_end >= x & date_end <= y,
hospital_id == z)$hospitalization_id),
hospid_all = lapply(hospid_all, unique),
# List of ego's hospitalizations in given hospital, in time interval
hospid_ego = pmap(list(date_range1, date_range2, hospital_id, physician_id),
function(x, y, z, p) filter(df,
date_end >= x & date_end <= y,
hospital_id == z,
physician_id == p)$hospitalization_id),
# List of peers' hospitalizations in given hospital, in time interval
hospid_peer = future_map2(hospid_all, hospid_ego, ~ .x[!(.x %in% .y)])) %>%
select(-starts_with('date_'), -hospid_all, -hospid_ego) %>% # only keep peers' list of hospitalization
rename('ego'='physician_id')
df3 <- df2 %>%
select(hospitalization_id, hospital_id, ego, hospid_peer) %>%
unnest(hospid_peer, keep_empty = TRUE)
df4 <- df3 %>%
left_join(select(df, hospitalization_id, physician_id), 
by=c('hospid_peer'='hospitalization_id')) %>%
rename(alter = physician_id)

每个医生的专业都在另一份df中告知。在该示例中,医师2与医师1和医师3共享专业,但医师1和医生3没有任何共同的专业。

physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
specialty_code = c(100, 100, 200, 200))

您可以创建两个辅助函数other_mdsf。其中第一个获取医师id,并返回具有匹配专业的医师id。第二个获取医院id、医生id和开始日期(即df中的特定行(,并返回前30天内结束的、在同一家医院的、由具有匹配专业的医生进行的住院列表。

other_mds <- function(pid) {
physician_spec[
physician_id!=pid & specialty_code %in% physician_spec[physician_id==pid, specialty_code],
physician_id]
}
f <- function(hid, pid, s) {
other_phys = other_mds(pid)
exclude_hosps = df[physician_id == pid, unique(hospitalization_id)]
df[hospital_id == hid & 
physician_id %in% other_phys &
s>date_end &
(s-date_end)<30 &
!hospitalization_id %in% exclude_hosps,
paste0(hospitalization_id, collapse=",")]
}

现在,我们只将函数f应用于每行

library(data.table)
setDT(df)
setDT(physician_spec)
df[, matches:=f(hospital_id, physician_id,date_start), 1:nrow(df)]

输出:

hospitalization_id hospital_id physician_id date_start   date_end matches
<num>      <char>        <num>     <Date>     <Date>  <char>
1:                  1           A            1 2000-01-01 2000-01-03        
2:                  2           A            1 2000-01-12 2000-01-18        
3:                  3           A            1 2000-01-20 2000-01-22        
4:                  1           A            2 2000-01-01 2000-01-03        
5:                  2           A            2 2000-01-12 2000-01-18        
6:                  3           A            2 2000-01-20 2000-01-22       4
7:                  4           A            3 2000-01-12 2000-01-18       1
8:                  5           A            3 2000-01-20 2000-01-22     1,2
9:                  6           B            2 2000-02-10 2000-02-11        
10:                  7           B            2 2000-02-11 2000-02-14        
11:                  8           B            2 2000-02-12 2000-02-17        

更新-返回匹配向量,然后合并:

  1. 更改f,使其返回向量
f <- function(hid, pid, s) {
other_phys = other_mds(pid)
exclude_hosps = df[physician_id == pid, unique(hospitalization_id)]
df[hospital_id == hid & 
physician_id %in% other_phys &
s>date_end &
(s-date_end)<30 &
!hospitalization_id %in% exclude_hosps]$hospitalization_id
}
  1. 现在,当我们运行函数时,我们通过hospitalization_idphysician_id来执行此操作,这样它就会返回一个三列的data.table(列是by列和名为match的新列。然后将其合并到原始df上
df[, .(match = f(hospital_id, physician_id,date_start)), .(hospitalization_id, physician_id)][
df, 
on=.(hospitalization_id,physician_id)
]

输出:

hospitalization_id physician_id match hospital_id date_start   date_end
<num>        <num> <num>      <char>     <Date>     <Date>
1:                  1            1    NA           A 2000-01-01 2000-01-03
2:                  2            1    NA           A 2000-01-12 2000-01-18
3:                  3            1    NA           A 2000-01-20 2000-01-22
4:                  1            2    NA           A 2000-01-01 2000-01-03
5:                  2            2    NA           A 2000-01-12 2000-01-18
6:                  3            2     4           A 2000-01-20 2000-01-22
7:                  4            3     1           A 2000-01-12 2000-01-18
8:                  5            3     1           A 2000-01-20 2000-01-22
9:                  5            3     2           A 2000-01-20 2000-01-22
10:                  6            2    NA           B 2000-02-10 2000-02-11
11:                  7            2    NA           B 2000-02-11 2000-02-14
12:                  8            2    NA           B 2000-02-12 2000-02-17

这是我到目前为止所拥有的,请用其他例子和其他数据集进行实验,看看它是否是你想要的:

df <- data.frame(hospitalization_id = c(1, 2, 3,
1, 2, 3,
4, 5, 
6, 7, 8),
hospital_id = c("A", "A", "A", 
"A", "A", "A", 
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1, 
2, 2, 2,
3, 3, 
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))
physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
specialty_code = c(100, 100, 200, 200)) %>%
group_by(physician_id) %>%
summarise(specialties = list(specialty_code))
df_with_date_range <- df %>%
mutate(date_range1 = date_start - 31,
date_range2 = date_start - 1) %>%
as_tibble() %>%
left_join(physician_spec, by = "physician_id") 
#Below uses specialty
df_with_date_range %>%
mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id, specialties),
function(x, y, z, p, s) filter(df_with_date_range,
date_start >= x & date_start <= y,
hospital_id == z,
physician_id != p,
any(specialties %in% s))$hospitalization_id)) %>%
unnest(hospital_id_in_range, keep_empty = TRUE)
#Below does not use specialty
df_with_date_range %>%
mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id),
function(x, y, z, p) filter(df_with_date_range,
date_start >= x & date_start <= y,
hospital_id == z,
physician_id != p)$hospitalization_id)) %>%
unnest(hospital_id_in_range, keep_empty = TRUE)

对于这个数据集,包含的专业和未包含的专业版本似乎是相同的,所以你必须仔细研究一下,看看我是否犯了错误,或者是这样。我基本上只是加入了数据帧,然后向pmap添加了另一个过滤条件。

最新更新