r-基于案例日期对数据进行聚类



我有一个20000个病例的数据集,每个病例都有一个发病日期("发病日期"(。每个病例都住在一个集体之家,我想根据他们在家中的发病日期对病例进行分类。

所以我想确定第一个出现在家里的病例。如果在第一个病例的14天内出现另一个病例,我想将它们添加到同一集群中。如果在集群中的任何其他病例的14天内出现另一个病例,我想将它们添加到同一集群中。一旦另一个病例距离上次病例超过14天,我就会停止向集群添加病例;这时,一个新的集群将形成,并且该过程将重新启动,直到所有人都被排序。集群的"开始日期"将是添加到集群的第一个病例的发病日期,结束日期将是最后一个病例添加到集群后的14天。

以下是一些伪数据:

dummy <- data.frame(case = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19), 
onsetdate  = as.Date(c("2012-08-30", "2012-09-03", "2012-09-09", "2012-09-17", "2012-11-01", "2012-11-05", "2012-11-30", "2012-08-30", "2012-09-03", "2012-10-09", "2012-10-17", "2012-10-30", "2020-12-26", "2020-12-23", "2020-12-30", "2020-12-25", "2021-04-22", "2021-05-03", "2021-05-10")),
position = c("Resident", "Staff", "Resident", "Staff", "Staff", "Resident", "Resident", "Staff", "Resident", "Staff", "Staff", "Resident", "Resident", "Resident", "Staff", "Resident", "Staff", "Staff", "Resident") , 
grouphome = c("Group Home 1", "Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 1","Group Home 2","Group Home 2","Group Home 2","Group Home 2", "Group Home 3", "Group Home 3","Group Home 3","Group Home 3","Group Home 3","Group Home 3","Group Home 3")
)

输出看起来是这样的:

result <- data.frame(grouphome  = c("Group Home 1", "Group Home 1","Group Home 1","Group Home 2","Group Home 2", "Group Home 3", "Group Home 3"), 
clusterNumber = c("1", "2", "3", "1", "2", "1", "2"), 
clusterStart = as.Date(c("2012-08-30", "2012-11-01", "2012-11-30", "2012-09-03", "2012-10-09", "2020-12-23", "2021-04-22")),
cases = c("5", "2", "1", "1", "3", "4", "3"))

提前感谢

似乎您首先想要group_by而不是grouphome

您也可以通过查看onsetdate中大于14天的差异来确定group_byclusterNumbercumsum或累积和的使用将为此提供计数器。

最后的summarise将取第一个日期作为组主集群内的clusterStart,而cases将是该集群的行数。

这假设日期已经按时间顺序排序。如果不是这样的话,您需要先arrange

编辑:还要为";"居民";以及";工作人员";对于每个clusterNumber,您可以为这两种情况中的每一种情况sumposition

library(dplyr)
dummy %>%
group_by(grouphome) %>%
group_by(clusterNumber = 1 + cumsum(c(0, diff(onsetdate) > 14)), .add = TRUE) %>%
summarise(clusterStart = first(onsetdate),
cases = n(),
resident = sum(position == "Resident"),
staff = sum(position == "Staff"))

输出

grouphome    clusterNumber clusterStart cases resident staff
<chr>                <dbl> <date>       <int>    <int> <int>
1 Group Home 1             1 2012-08-30       4        2     2
2 Group Home 1             2 2012-11-01       2        1     1
3 Group Home 1             3 2012-11-30       2        1     1
4 Group Home 2             1 2012-09-03       1        1     0
5 Group Home 2             2 2012-10-09       3        1     2

最新更新