这就是我的数据的样子。我希望行按日期,ID1和ID2分组。ID3中具有的行分为所有行它们是一个子集的子集,即第一和第二ID匹配。另外,要添加的统计信息以及要生成的n((。
date ID1 ID2 ID3 stat1 stat2 stat3
1 12-03-07 abc123 wxy456 pqr123 10 20 30
2 12-03-07 abc123 wxy456 pqr123 20 40 60
3 10-04-07 bcd456 wxy456 hgf356 10 20 40
4 12-03-07 abc123 wxy456 hfz123 30 60 90
5 12-03-07 abc123 wxy456 <NA> 40 50 70
所需的输出
date ID1, ID2, ID3, n, stat1, stat2, stat3
12-03-07 abc123, wxy456, pqr123, 3, 70, 110, 160
10-04-07 bcd456, wxy456, hgf356, 1, 10, 20, 40
12-03-07 abc123, wxy456, hfz123, 2 , 40, 50, 70
可能更优雅的解决方案,但使用dplyr的groupby/摘要(如亚当·奎克(Adam Quek(的代码中(解决了这一点,然后计算平均统计数据。
# Summarize
df <- df %>% group_by(date, ID1, ID2, ID3) %>% summarise(n=n(), stat1=sum(stat1), stat2=sum(stat2), stat3=sum(stat3)
# Select instances where NA
dfNA <- df %>% filter(is.na(ID3))
# Select instances where no NA
df1 <- df %>% filter(!is.na(ID3))
# Join these
dfBig <- df1 %>% full_join(dfNA, by = c("date", "ID1")) %>%
subset(select= c("ID1", "date", "n.x", "n.y", "stat1.x", "stat1.y", "stat2.x", "stat2.y", "stat3.x", "stat3.y"))
# Replace <NA>s by 0
dfBig$stat1.x[is.na(dfBig$stat1.x)] <- 0
dfBig$stat1.y[is.na(dfBig$stat1.y)] <- 0
dfBig$stat2.x[is.na(dfBig$stat1.x)] <- 0
dfBig$stat2.y[is.na(dfBig$stat1.y)] <- 0
dfBig$stat3.x[is.na(dfBig$stat1.x)] <- 0
dfBig$stat3.y[is.na(dfBig$stat1.y)] <- 0
dfBig$n.x[is.na(dfBig$n.x)] <- 0
dfBig$n.y[is.na(dfBig$n.y)] <- 0
# Compute Mean stats and Rename Columns
dfBig$stat1Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
dfBig$stat2Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
dfBig$stat3Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
dfBig$n2 <- dfBig$n.x + dfBig$n.y