我有来自几个诊所的入院和出院数据,需要确定同一诊所的 2+ 名患者之间是否具有相同病原体的 DNA 株的入院重叠。在 R 中工作。
只有 5 个变量:研究 ID、日期输入、日期输出、DNA 株和临床。每一行都是一次单独的访问,这意味着"ID"变量可以重复。我需要确定有多少"重叠",这意味着具有相同病原体DNA的患者同时在同一诊所。
这是一个(看似(简单的问题,在这个网站上讨论了很多。当每个ID是唯一的时,我能够识别重叠。我的具体问题是如何处理重复的ID。在下面的代码中,我提供了示例数据和我的代码,当每个 ID 是唯一时,这些代码有效(您可以通过在代码的第一行中将seq(1:20)),each=2)
更改为seq(1:40)),each=1)
来快速测试这一点(。如果我使用重复 ID 运行此代码,它会将所有访问标记为具有任何重叠的 ID 的重叠 = 1,无论它是否是重叠的实际访问。
我知道重叠的日期间隔是一个经常讨论的话题,所以请相信我已经彻底查看了我能找到的所有内容,除了最后一部分之外,几乎做到了。任何建议都非常感谢。
谢谢!
#Set globals
set.seed(8126)
library(lubridate); library(data.table)
#Example Data
have<-data.frame(rep(paste0("k",seq(1:20)),each=2),sample(seq(as.Date('2015/01/01'), as.Date('2020/01/01'), by="day"), 40))
names(have)<-c("id","datein")
have$dateout<-have$datein+40 #arbitrarily add 40 days to admission date
have$dnastrain<-as.vector(replicate(20,rep(sample(c("Type 1","Type 2","Type 3"),1),each=2)))
have$clinic<-sample(c("A","B","C","D"),40, replace=TRUE)
#Code that works if each ID is unique
setDT(have); setkey(have,datein, dateout) #Convert to DT and set date in/out as keys
overlaps<-unique(foverlaps(have, have)[id!=i.id & dnastrain==i.dnastrain & clinic==i.clinic, id]) #find overlaps
have[,`:=`(overlap=0)][id %in% overlaps, overlap:=1][order(datein)] #Identify overlaps
这是一个 dplyr 方法。
have$rownum <- 1:nrow(have)
crossover <- have %>%
inner_join(have, by = c("clinic", "dnastrain"), suffix=(c(".a",".b"))) %>%
filter(datein.a<=dateout.b &
dateout.a>=datein.b &
id.a!=id.b &
rownum.a < rownum.b
)
crossover
id.a datein.a dateout.a dnastrain clinic rownum.a id.b datein.b dateout.b rownum.b
1 k1 2017-11-02 2017-12-12 Type 3 B 1 k18 2017-10-03 2017-11-12 35
2 k10 2016-02-15 2016-03-26 Type 3 A 19 k13 2016-03-22 2016-05-01 26
3 k17 2017-08-06 2017-09-15 Type 2 C 33 k20 2017-09-02 2017-10-12 39
我想出了一种使用purrr
稍微不同的方法:
participant_in <- have$datein
participant_out <- have$dateout
# For each participant, create a vector of days at which they were in a clinic
days <- map2(participant_in, participant_out, ~ seq(from = ymd(.x), to = ymd(.y), by = 'days'))
ids <- map2(have$id, days, ~ rep(.x, length(.y)))
days <- days %>% reduce(c)
ids <- ids %>% unlist()
participant_tib <- tibble(id = ids, day = days) %>%
left_join(have, by = 'id') %>%
select(-datein, -dateout)
participant_tib %>%
group_by(day, dnastrain, clinic) %>%
count() %>%
arrange(desc(n))
使用上面的^,您可以检查在同一诊所中有多少名参与者具有相同的菌株。从那里,您可以filter
n> 1 的天数,并检查参与者 ID 中的这些天数。
您可以使用连接而不是%in%
:
setDT(have, key=c("datein","dateout"))
overlaps <- unique(foverlaps(have, have)[clinic==i.clinic, dnastrain==i.dnastrain & id!=i.id])
cols <- copy(names(have))
have[, overlap := 0][
overlaps, on=cols, overlap := 1][
order(datein)]
输出:
id datein dateout dnastrain clinic overlap
1: k6 2015-01-01 2015-02-10 Type 1 D 0
2: k6 2015-01-20 2015-03-01 Type 1 D 0
3: k9 2015-04-09 2015-05-19 Type 2 B 1
4: k11 2015-04-12 2015-05-22 Type 2 B 1
5: k10 2015-04-19 2015-05-29 Type 3 C 0
6: k10 2015-07-03 2015-08-12 Type 3 A 0
7: k4 2015-08-06 2015-09-15 Type 1 C 0
8: k16 2015-08-26 2015-10-05 Type 2 A 0
9: k13 2016-05-20 2016-06-29 Type 3 D 1
10: k19 2016-05-26 2016-07-05 Type 3 D 1
11: k15 2016-06-23 2016-08-02 Type 3 A 0
12: k7 2016-06-29 2016-08-08 Type 3 B 0
13: k18 2016-07-18 2016-08-27 Type 2 B 0
14: k13 2016-08-11 2016-09-20 Type 3 A 0
15: k1 2016-09-24 2016-11-03 Type 3 C 0
16: k18 2016-09-29 2016-11-08 Type 2 B 0
17: k12 2017-03-13 2017-04-22 Type 3 B 0
18: k7 2017-03-24 2017-05-03 Type 3 C 0
19: k14 2017-05-28 2017-07-07 Type 3 B 0
20: k3 2017-06-05 2017-07-15 Type 2 B 0
21: k17 2017-07-06 2017-08-15 Type 3 A 0
22: k17 2017-09-08 2017-10-18 Type 3 B 1
23: k2 2017-09-30 2017-11-09 Type 3 B 1
24: k15 2017-10-07 2017-11-16 Type 3 D 0
25: k8 2018-01-01 2018-02-10 Type 1 C 0
26: k5 2018-07-07 2018-08-16 Type 2 B 0
27: k20 2018-07-18 2018-08-27 Type 1 A 0
28: k5 2018-08-30 2018-10-09 Type 2 B 0
29: k16 2018-09-22 2018-11-01 Type 2 D 0
30: k14 2018-09-24 2018-11-03 Type 3 B 0
31: k2 2018-11-02 2018-12-12 Type 3 A 1
32: k12 2018-11-13 2018-12-23 Type 3 A 1
33: k19 2018-11-21 2018-12-31 Type 3 D 0
34: k3 2018-11-30 2019-01-09 Type 2 A 0
35: k1 2018-12-14 2019-01-23 Type 3 A 1
36: k9 2019-03-13 2019-04-22 Type 2 D 0
37: k8 2019-07-18 2019-08-27 Type 1 C 1
38: k20 2019-08-10 2019-09-19 Type 1 C 1
39: k11 2019-08-30 2019-10-09 Type 2 B 0
40: k4 2019-09-04 2019-10-14 Type 1 B 0
id datein dateout dnastrain clinic overlap
数据:
set.seed(8126)
library(data.table)
have<-data.frame(rep(paste0("k",seq(1:20)),each=2),sample(seq(as.Date('2015/01/01'), as.Date('2020/01/01'), by="day"), 40))
names(have)<-c("id","datein")
have$dateout<-have$datein+40 #arbitrarily add 40 days to admission date
have$dnastrain<-as.vector(replicate(20,rep(sample(c("Type 1","Type 2","Type 3"),1),each=2)))
have$clinic<-sample(c("A","B","C","D"),40, replace=TRUE)
这是我今天刚刚学到的解决方案,使用dplyr
和purrr
library(dplyr)
library(purrr)
library(lubridate)
have %>%
mutate(interval=interval(datein,dateout)) %>%
group_by(dnastrain,clinic) %>%
mutate(overlap = purrr::map_int(row_number(),
~+any(datein[.x] %within% interval[-.x] | dateout[.x] %within% interval[-.x])))