我有一个数据集,其中每一行都由住院id-医生id标识。每一行还包含入院和出院日期以及住院医院的信息。住院治疗可能涉及多名医生。医生可以在多家医院工作。
我有另一个数据集,里面有每个医生的专业信息(例如临床医生、心脏病专家(。医生可能有多种专业。
我想知道,对于每个住院医师行,在该住院开始前30天内由同一专业的其他医生在同一家医院进行的所有其他住院的id。
部分使用R(dplyr(中发布的解决方案:查找特定行范围内的所有行*WITH RESTRICTION*,我设法编写了一个代码,查找给定住院开始前30天内同一医院其他医生执行的所有住院治疗。对于每一行,我首先找到了30天内某家医院的所有住院患者名单。然后,我发现了一份清单,其中只包括自我医生参与的住院情况。最后,我选择了第一个列表中不在第二个列表中的元素。
我想调整代码,以查找与自我医生至少有一个专业的其他医生的住院情况。理想情况下,我想更改上面代码的第一步,以查找自我医生专业范围内的所有住院患者列表。然后我可以使用代码的其余部分,从这个列表中减去涉及自我医生的住院人数。这里的主要困难是医生可能有多种专业——否则,只需要在过滤函数中包含另一个变量。
下面是我现在的代码——它没有考虑到自我医生的专业。
df <- data.frame(hospitalization_id = c(1, 2, 3,
1, 2, 3,
4, 5,
6, 7, 8),
hospital_id = c("A", "A", "A",
"A", "A", "A",
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1,
2, 2, 2,
3, 3,
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))
df2 <- df %>%
mutate(
# Generates 30-day time interval before start of given hospitalization
date_range1 = date_start - 30,
date_range2 = date_start - 1,
# List of all hospitalizations in given hospital, in time interval
hospid_all = pmap(list(date_range1, date_range2, hospital_id),
function(x, y, z) filter(df,
date_end >= x & date_end <= y,
hospital_id == z)$hospitalization_id),
hospid_all = lapply(hospid_all, unique),
# List of ego's hospitalizations in given hospital, in time interval
hospid_ego = pmap(list(date_range1, date_range2, hospital_id, physician_id),
function(x, y, z, p) filter(df,
date_end >= x & date_end <= y,
hospital_id == z,
physician_id == p)$hospitalization_id),
# List of peers' hospitalizations in given hospital, in time interval
hospid_peer = future_map2(hospid_all, hospid_ego, ~ .x[!(.x %in% .y)])) %>%
select(-starts_with('date_'), -hospid_all, -hospid_ego) %>% # only keep peers' list of hospitalization
rename('ego'='physician_id')
df3 <- df2 %>%
select(hospitalization_id, hospital_id, ego, hospid_peer) %>%
unnest(hospid_peer, keep_empty = TRUE)
df4 <- df3 %>%
left_join(select(df, hospitalization_id, physician_id),
by=c('hospid_peer'='hospitalization_id')) %>%
rename(alter = physician_id)
每个医生的专业都在另一份df中告知。在该示例中,医师2与医师1和医师3共享专业,但医师1和医生3没有任何共同的专业。
physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
specialty_code = c(100, 100, 200, 200))
您可以创建两个辅助函数other_mds
和f
。其中第一个获取医师id,并返回具有匹配专业的医师id。第二个获取医院id、医生id和开始日期(即df
中的特定行(,并返回前30天内结束的、在同一家医院的、由具有匹配专业的医生进行的住院列表。
other_mds <- function(pid) {
physician_spec[
physician_id!=pid & specialty_code %in% physician_spec[physician_id==pid, specialty_code],
physician_id]
}
f <- function(hid, pid, s) {
other_phys = other_mds(pid)
exclude_hosps = df[physician_id == pid, unique(hospitalization_id)]
df[hospital_id == hid &
physician_id %in% other_phys &
s>date_end &
(s-date_end)<30 &
!hospitalization_id %in% exclude_hosps,
paste0(hospitalization_id, collapse=",")]
}
现在,我们只将函数f
应用于每行
library(data.table)
setDT(df)
setDT(physician_spec)
df[, matches:=f(hospital_id, physician_id,date_start), 1:nrow(df)]
输出:
hospitalization_id hospital_id physician_id date_start date_end matches
<num> <char> <num> <Date> <Date> <char>
1: 1 A 1 2000-01-01 2000-01-03
2: 2 A 1 2000-01-12 2000-01-18
3: 3 A 1 2000-01-20 2000-01-22
4: 1 A 2 2000-01-01 2000-01-03
5: 2 A 2 2000-01-12 2000-01-18
6: 3 A 2 2000-01-20 2000-01-22 4
7: 4 A 3 2000-01-12 2000-01-18 1
8: 5 A 3 2000-01-20 2000-01-22 1,2
9: 6 B 2 2000-02-10 2000-02-11
10: 7 B 2 2000-02-11 2000-02-14
11: 8 B 2 2000-02-12 2000-02-17
更新-返回匹配向量,然后合并:
- 更改
f
,使其返回向量
f <- function(hid, pid, s) {
other_phys = other_mds(pid)
exclude_hosps = df[physician_id == pid, unique(hospitalization_id)]
df[hospital_id == hid &
physician_id %in% other_phys &
s>date_end &
(s-date_end)<30 &
!hospitalization_id %in% exclude_hosps]$hospitalization_id
}
- 现在,当我们运行函数时,我们通过
hospitalization_id
和physician_id
来执行此操作,这样它就会返回一个三列的data.table(列是by列和名为match
的新列。然后将其合并到原始df上
df[, .(match = f(hospital_id, physician_id,date_start)), .(hospitalization_id, physician_id)][
df,
on=.(hospitalization_id,physician_id)
]
输出:
hospitalization_id physician_id match hospital_id date_start date_end
<num> <num> <num> <char> <Date> <Date>
1: 1 1 NA A 2000-01-01 2000-01-03
2: 2 1 NA A 2000-01-12 2000-01-18
3: 3 1 NA A 2000-01-20 2000-01-22
4: 1 2 NA A 2000-01-01 2000-01-03
5: 2 2 NA A 2000-01-12 2000-01-18
6: 3 2 4 A 2000-01-20 2000-01-22
7: 4 3 1 A 2000-01-12 2000-01-18
8: 5 3 1 A 2000-01-20 2000-01-22
9: 5 3 2 A 2000-01-20 2000-01-22
10: 6 2 NA B 2000-02-10 2000-02-11
11: 7 2 NA B 2000-02-11 2000-02-14
12: 8 2 NA B 2000-02-12 2000-02-17
这是我到目前为止所拥有的,请用其他例子和其他数据集进行实验,看看它是否是你想要的:
df <- data.frame(hospitalization_id = c(1, 2, 3,
1, 2, 3,
4, 5,
6, 7, 8),
hospital_id = c("A", "A", "A",
"A", "A", "A",
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1,
2, 2, 2,
3, 3,
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))
physician_spec <- data.frame(physician_id = c(1, 2, 2, 3),
specialty_code = c(100, 100, 200, 200)) %>%
group_by(physician_id) %>%
summarise(specialties = list(specialty_code))
df_with_date_range <- df %>%
mutate(date_range1 = date_start - 31,
date_range2 = date_start - 1) %>%
as_tibble() %>%
left_join(physician_spec, by = "physician_id")
#Below uses specialty
df_with_date_range %>%
mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id, specialties),
function(x, y, z, p, s) filter(df_with_date_range,
date_start >= x & date_start <= y,
hospital_id == z,
physician_id != p,
any(specialties %in% s))$hospitalization_id)) %>%
unnest(hospital_id_in_range, keep_empty = TRUE)
#Below does not use specialty
df_with_date_range %>%
mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id),
function(x, y, z, p) filter(df_with_date_range,
date_start >= x & date_start <= y,
hospital_id == z,
physician_id != p)$hospitalization_id)) %>%
unnest(hospital_id_in_range, keep_empty = TRUE)
对于这个数据集,包含的专业和未包含的专业版本似乎是相同的,所以你必须仔细研究一下,看看我是否犯了错误,或者是这样。我基本上只是加入了数据帧,然后向pmap
添加了另一个过滤条件。