我如何在r中进行分组,其中一些行(其他子集)属于多个组

  • 本文关键字:其他 子集 属于 r group-by row subset
  • 更新时间 :
  • 英文 :


这就是我的数据的样子。我希望行按日期,ID1和ID2分组。ID3中具有的行分为所有行它们是一个子集的子集,即第一和第二ID匹配。另外,要添加的统计信息以及要生成的n((。

      date     ID1     ID2     ID3 stat1 stat2 stat3
1 12-03-07  abc123  wxy456  pqr123    10    20     30  
2 12-03-07  abc123  wxy456  pqr123    20    40     60  
3 10-04-07  bcd456  wxy456  hgf356    10    20     40  
4 12-03-07  abc123  wxy456  hfz123    30    60     90  
5 12-03-07  abc123  wxy456   <NA>     40    50     70  

所需的输出

date      ID1,     ID2,   ID3,   n, stat1, stat2, stat3  
12-03-07 abc123, wxy456, pqr123, 3,   70,   110, 160  
10-04-07 bcd456, wxy456, hgf356, 1,   10,   20,  40  
12-03-07 abc123, wxy456, hfz123, 2 ,  40,   50,  70  

可能更优雅的解决方案,但使用dplyr的groupby/摘要(如亚当·奎克(Adam Quek(的代码中(解决了这一点,然后计算平均统计数据。

        # Summarize 
        df <- df %>% group_by(date, ID1, ID2, ID3) %>% summarise(n=n(), stat1=sum(stat1), stat2=sum(stat2), stat3=sum(stat3)
        # Select instances where NA
        dfNA <- df %>% filter(is.na(ID3))
        # Select instances where no NA
        df1 <- df %>% filter(!is.na(ID3))
        # Join these
        dfBig <- df1 %>% full_join(dfNA, by = c("date", "ID1")) %>% 
          subset(select= c("ID1", "date", "n.x", "n.y", "stat1.x", "stat1.y", "stat2.x", "stat2.y", "stat3.x", "stat3.y"))
        # Replace <NA>s by 0
        dfBig$stat1.x[is.na(dfBig$stat1.x)] <- 0
        dfBig$stat1.y[is.na(dfBig$stat1.y)] <- 0
        dfBig$stat2.x[is.na(dfBig$stat1.x)] <- 0
        dfBig$stat2.y[is.na(dfBig$stat1.y)] <- 0
        dfBig$stat3.x[is.na(dfBig$stat1.x)] <- 0
        dfBig$stat3.y[is.na(dfBig$stat1.y)] <- 0
        dfBig$n.x[is.na(dfBig$n.x)] <- 0
        dfBig$n.y[is.na(dfBig$n.y)] <- 0
    # Compute Mean stats and Rename Columns
        dfBig$stat1Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
        dfBig$stat2Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
        dfBig$stat3Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
        dfBig$n2 <- dfBig$n.x + dfBig$n.y

最新更新