r-在x天内计算每个组的唯一ID

  • 本文关键字:唯一 ID 计算 r data.table
  • 更新时间 :
  • 英文 :


这是我在这里发布的一个问题的后续问题:在R 中统计过去x天内ID的出现次数

我想再做一次滚动计数。我有以下数据:

date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01")
group = c("G","G","F","G","E","E","H","H","H","A")
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1)
group       date ID
1:     G 2014-04-01  2
2:     G 2014-04-12  3
3:     F 2014-04-07  4
4:     G 2014-05-03  2
5:     E 2014-04-14  3
6:     E 2014-05-04  1
7:     H 2014-03-31  2
8:     H 2014-04-18  4
9:     H 2014-04-23  2
10:     A 2014-04-01  1

对于每个group,我想统计当前date过去30天内唯一的ID的数量。所需的计数列如下所示:

group       date ID  count
1:     G 2014-04-01  2      1
2:     G 2014-04-12  3      2
3:     F 2014-04-07  4      1
4:     G 2014-05-03  1      2
5:     E 2014-04-14  3      1
6:     E 2014-05-04  1      2
7:     H 2014-03-31  2      1
8:     H 2014-04-18  4      2
9:     H 2014-04-23  2      3
10:     A 2014-04-01  1      1

在我之前的线程中,提供了@ThomasIsCoding的解决方案。我试图修改他的代码来执行我现在正试图通过以下操作来完成的任务:

dt[date <= first(date) + 30, date := as.Date(date)][, count := uniqueN(ID), group]
group       date ID count
1:     G 2014-04-01  2     2
2:     G 2014-04-12  3     2
3:     F 2014-04-07  4     1
4:     G 2014-05-03  2     2
5:     E 2014-04-14  3     2
6:     E 2014-05-04  1     2
7:     H 2014-03-31  2     2
8:     H 2014-04-18  4     2
9:     H 2014-04-23  2     2
10:     A 2014-04-01  1     1

但它似乎没有考虑到时间范围的条件。非常感谢您的任何建议!

使用非等联接的选项:

DT[, onemthago := date - 30L]    
DT[, count := 
DT[.SD, on=.(group, date>=onemthago, date<=date),
by=.EACHI, length(unique(ID))]$V1
]

输出:

group       date ID  onemthago count
1:     G 2014-04-01  2 2014-03-02     1
2:     G 2014-04-12  3 2014-03-13     2
3:     F 2014-04-07  4 2014-03-08     1
4:     G 2014-05-03  2 2014-04-03     2
5:     E 2014-04-14  3 2014-03-15     1
6:     E 2014-05-04  1 2014-04-04     2
7:     H 2014-03-31  2 2014-03-01     1
8:     H 2014-04-18  4 2014-03-19     2
9:     H 2014-04-23  2 2014-03-24     2
10:     A 2014-04-01  1 2014-03-02     1

数据:

date = as.Date(c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01"))
group = c("G","G","F","G","E","E","H","H","H","A")
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1)
library(data.table)
DT <- data.table(group, date, ID)

编辑以处理多个回顾期的注释。你可以试试这样的东西:

for (x in c(30L, 90L)) {
DT[, daysago := date - x]

DT[, paste0("count", x) := 
.SD[.SD, on=.(group, date>=daysago, date<=date),
by=.EACHI, length(unique(ID))]$V1
][]
}
DT

如果我正确理解了你的问题,tidyverse中的一种替代方法是:

library(tidyverse)
tb <- dplyr::tibble(date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01"),
group = c("G","G","F","G","E","E","H","H","H","A"),
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1))
tb %>% 
dplyr::group_by(group) %>% 
dplyr::mutate(as.numeric(difftime(Sys.Date(), date)) < 31) %>% 
dplyr::distinct(ID) %>% 
dplyr::count(group) %>% 
dplyr::right_join(tb) %>% 
dplyr::select(group, date, ID, Count = n)
group date          ID Count
<chr> <chr>      <dbl> <int>
1 A     2014-04-01     1     1
2 E     2014-04-14     3     2
3 E     2014-05-04     1     2
4 F     2014-04-07     4     1
5 G     2014-04-01     2     2
6 G     2014-04-12     3     2
7 G     2014-05-03     2     2
8 H     2014-03-31     2     2
9 H     2014-04-18     4     2
10 H     2014-04-23     2     2

对于滚动窗口类型的函数,这应该是一个解决方案:

tb %>% 
dplyr::full_join(tb, by = "group") %>% 
dplyr::filter(as.numeric(difftime(as.Date(date.x), as.Date(date.y), units = "days")) >= 0 & as.numeric(difftime(date.x, date.y, units = "days")) < 31) %>% 
dplyr::distinct(group, date.x, ID.y) %>% 
dplyr::count(group, date.x) %>% 
# you might want to cut the pipe here and look at the result (do not forget to delete the %>% in the line above when removing the part below
dplyr::right_join(tb, by = c("group", "date.x" = "date")) %>% 
dplyr::select(group, date = date.x, ID, count = n)
group date          ID count
<chr> <chr>      <dbl> <int>
1 A     2014-04-01     1     1
2 E     2014-04-14     3     1
3 E     2014-05-04     1     2
4 F     2014-04-07     4     1
5 G     2014-04-01     2     1
6 G     2014-04-12     3     2
7 G     2014-05-03     2     2
8 H     2014-03-31     2     1
9 H     2014-04-18     4     2
10 H     2014-04-23     2     2

查看此建议的解决方案在您的案例中的效果如何。


date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01")
group = c("G","G","F","G","E","E","H","H","H","A")
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1)
dt <- data.table( date=as.Date(date), group, ID )
setkey( dt, group, date )
calc.id <- function(ID,date) {
## there is always going to be at least 1
id.count <- 1
if( length(ID) > 1 ) {
v <- sapply( 2:length(ID), function(i) {
j <- date[ 1:i ] >= date[i] - 30
uniqueN( ID[j] )
})
id.count <- c( id.count, v )
}
return( id.count )
}
dt[ , count := calc.id(ID,date), by=group ]
dt

除了使用sapply进行滚动计算之外,我没有找到其他方法。我怀疑它是否会优于上述解决方案。

它产生这样的:

date group ID count
1: 2014-04-01     A  1     1
2: 2014-04-14     E  3     1
3: 2014-05-04     E  1     2
4: 2014-04-07     F  4     1
5: 2014-04-01     G  2     1
6: 2014-04-12     G  3     2
7: 2014-05-03     G  2     2
8: 2014-03-31     H  2     1
9: 2014-04-18     H  4     2
10: 2014-04-23     H  2     2

相关内容

  • 没有找到相关文章

最新更新