R-基于可调的,非重叠平均周期(4、7、30、42天)的平均值,同时汇总(分组)基于多个变量



我想在一组数据中应用一个独特的(不滚动(七天的平均时间,但是七天窗口直到"找到"示例并为不基于日历周。

我已经尝试了以下代码,但是此代码的问题是,它为数据集中的每个示例提供了一个滚动的平均示例。相反,我需要汇总所有在平均周期内出现在一个样本中的样本。

library(plyr)
library(dplyr)
library(lubridate)

Analyte<-c("Copper", "Copper", "Copper", "Copper", "Nickel", "Nickel", "Nickel")
Date<-mdy(c("1/1/2015", "1/3/2015", "1/12/2015", "1/15/2015", "1/3/2015", "1/6/2015", "1/8/2015"))
Matrix<-c("Water", "Water", "Water", "Water", "Water", "Water", "Water")
Fraction<-c("Total", "Total", "Total", "Total", "Dissolved", "Dissolved", "Dissolved")
Result<-c(0.6, 0.3, 0.5, 0.6, 0.1, 0.9, 1.0)
d<-cbind.data.frame(Analyte, Date, Matrix, Fraction, Result)

d$Date2<-d$Date
d$dateinterval<-interval(d$Date2-days(7), d$Date2+days(7))
d2<-ddply(d, c("Analyte", "Matrix", "Fraction"),function(df){
  SevenDayResultMean<-rep(NA, length(df$Date))
  SevenDayN<-rep(NA, length(df$Date))
  for(i in 1:length(df$Date)){
    SevenDayResultMean[i]<-mean(df$Result[df$Date2%within%df$dateinterval[i]], na.rm=T)
    SevenDayN[i]<-length(df$Result[df$Date2%within%df$dateinterval[i]])
  }
  return(data.frame(SevenDayResultMean=SevenDayResultMean, Date=as.character(df$Date), SevenDayN=SevenDayN))
}
)

上面的代码返回下表,这是滚动平均值,而不是我需要的。在下表中,将第一个镍样品与以下两个镍样品进行平均。然后将第二个样品与第一个和最后一个样本进行平均,依此类推。

Analyte     Matrix     Fraction    SevenDayResultMean   Date       SevenDayN        
Copper      Water      Total       0.45                 2015-01-01        2
Copper      Water      Total       0.3                  2015-01-03        2
Copper      Water      Total       0.55                 2015-01-12        2
Copper      Water      Total       0.6                  2015-01-15        2
Nickel      Water      Dissolved   0.67                 2015-01-03        3
Nickel      Water      Dissolved   0.95                 2015-01-06        3
Nickel      Water      Dissolved   1.0                  2015-01-08        3

理想情况下,我将定义一个平均周期,然后按类似值分组所有其他变量。我需要像以下几个桌子:

Analyte    Date       Matrix     Fraction     Result
Copper     1/1/2015   Water      Total        0.45
Copper     1/12/2015  Water      Total        0.55
Nickel     1/3/2015   Water      Dissolved    0.67

在这里,将前两个样品平均,因为在第一个样品的七天内有相同的分数,矩阵和分析物,并成为结果表中的第一个入口。对于接下来的两个铜和所有镍样品的样品的平均值相同。在结果表中适用于样本的日期只要日期在平均七天内。

使用dplyr,我们可以做:

library(dplyr)
d %>% 
  group_by(Analyte, Matrix, Fraction) %>% 
  mutate(interval = cumsum(Date - lag(Date, default = min(Date)) >= 7)) %>% 
  group_by(interval, add = TRUE) %>% 
  summarise(Date = min(Date), Result = mean(Result)) %>% 
  select(Analyte, Date, Matrix, Fraction, Result)
#> Source: local data frame [3 x 5]
#> Groups: Analyte, Matrix, Fraction [2]
#> 
#>   Analyte       Date Matrix  Fraction    Result
#>    <fctr>     <date> <fctr>    <fctr>     <dbl>
#> 1  Copper 2015-01-01  Water     Total 0.4500000
#> 2  Copper 2015-01-12  Water     Total 0.5500000
#> 3  Nickel 2015-01-03  Water Dissolved 0.6666667

数据:

library(lubridate)
Analyte <- c("Copper", "Copper", "Copper", "Copper", "Nickel", "Nickel", "Nickel")
Date <- mdy(c("1/1/2015", "1/3/2015", "1/12/2015", "1/15/2015", "1/3/2015", "1/6/2015", "1/8/2015"))
Matrix <- c("Water", "Water", "Water", "Water", "Water", "Water", "Water")
Fraction <- c("Total", "Total", "Total", "Total", "Dissolved", "Dissolved", "Dissolved")
Result <- c(0.6, 0.3, 0.5, 0.6, 0.1, 0.9, 1.0)
d <- data.frame(Analyte, Date, Matrix, Fraction, Result)

最新更新