子集和汇总(总和、平均值和标准差)使用R中的data.table()和dplyr(),按月份对多年的多级数据进行汇总



问题

我有一个名为"FID"的数据框架(见下文(,其中包含三年内每月FID的频率。

我想通过使用data.table和dplyr软件包计算3年内FID的总频率来对我的数据帧进行子集,然后我想总结我的数据,使其包含:-

汇总数据帧

  1. 月份
  2. FID在3年内每月的总频率
  3. 3年内每月FID的平均频率
  4. FID在3年内每月的标准偏差
  5. FID在3年内每月的标准误差
  6. 3年内每月的置信度下限和上限

我无法通过添加3年内每月FID的总频率来计算数据帧的子集。例如,在下面的例子中,三年中一月的总频率为-86+66+56=208,我希望每个月都这样做。

###What a section of my data frame looks like      ##Desired outcome
Year     Month       FID                             Month       FID  
2018    January       86                             January     208
2019    January       66                             February    176
2020    January       56
2018    February      76
2019    February      55
2020    February      45

汇总数据帧所需列

Month   Total_FID    Mean_FID   SD_FID   S.E_FID  Ci_Lower  Ci_Upper

我不是一个高级R用户,我已经在线阅读了许多Stack Overflow页面和教程,我无法找到正确的过程(请参阅下面的R代码(来生成我想要的汇总数据帧。我也找不到使用包data.table来产生上置信区间和下置信区间的方法。知道如何使用data.table和dplyr来完成这个过程会非常方便,因为我最经常使用这两个包。

如果有人能帮忙,我将不胜感激。

提前感谢

R-代码

使用数据表

##Reformat into a data.table object
FID_Table<-data.table(FID)
##Summary statistics
FID.Summarised=FID_Table[, sum(FID), 
Month=.N,
Mean_FID=mean(FID),
SD_FID=sd(FID),
S.E = std.error(FID),
by=Month]
##Error message
Error in `[.data.table`(FID_Table, , sum(FID), Month = .N, Mean_FID = mean(FID),  : 
unused arguments (Month = .N, Mean_FID = mean(FID), SD_FID = sd(FID), S.E = std.error(FID))

使用dplyr

###Create a vector to ensure the dates are in the right order
library(tidyverse) #loads dplyr and ggplot2

month_levels = c('January', 'February', 'March', 'April', 'May', 'June', 
'July', 'August', 'September', 'October', 'November', 'December')

p<-FID.Counts %>%
dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
group_by(Month) %>% 
dplyr::summarise(Mean.Month=mean(FID), 
sd.month = sd(FID, na.rm = TRUE),
n_FID=n(),
sem=sd(FID)/sqrt(n()), 
ci_low=Mean.Month-1.96*sem, 
ci_hi=Mean.Month+1.96*sem) %>% 
ungroup()       

##The frequency of FID per month over three years is missing (see FID.Count below)
This code produces the summary statistics but does not subset the 
data frame by the frequency counts of FID per month over the last 3 
years.
#This table below summed the FID frequencies for each month using 
this code using data.table. 
library(data.table)
FID.Count=FID_Table[, sum(FID), by=(Month)]
Month  V1
1:   January 165
2:  February 182
3:     March 179
4:     April 104
5:       May 124
6:      June  10
7:      July  15
8:    August 133
9: September  97
10:   October  82
11:  November  75
12:  December 102

使用dplyr的R代码生成的汇总统计信息

Month Mean.Month  sd.month n_FID       sem      ci_low     ci_hi
1    January  55.000000 10.535654     3  6.082763  43.0777854 66.922215
2   February  60.666667 29.737743     3 17.169094  27.0152431 94.318090
3      March  59.666667 33.291641     3 19.220938  21.9936289 97.339704
4      April  34.666667 16.862186     3  9.735388  15.5853064 53.748027
5        May  41.333333 49.571497     3 28.620117 -14.7620965 97.428763
6       June   3.333333  5.773503     3  3.333333  -3.2000000  9.866667
7       July   5.000000  4.358899     3  2.516611   0.0674415  9.932558
8     August  44.333333 21.007935     3 12.128937  20.5606169 68.106050
9  September  32.333333 21.548395     3 12.440972   7.9490287 56.717638
10   October  27.333333 13.051181     3  7.535103  12.5645314 42.102135
11  November  25.000000 19.000000     3 10.969655   3.4994760 46.500524
12  December  34.000000  4.582576     3  2.645751  28.8143274 39.185673

数据帧:"FID">

structure(list(Year = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 
2017L, 2017L, 2017L), Month = structure(c(5L, 4L, 8L, 1L, 9L, 
7L, 6L, 2L, 12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 
12L, 11L, 10L, 3L, 5L, 4L, 8L, 1L, 9L, 7L, 6L, 2L, 12L, 11L, 
10L, 3L), .Label = c("April", "August", "December", "February", 
"January", "July", "June", "March", "May", "November", "October", 
"September"), class = "factor"), FID = c(65L, 88L, 43L, 54L, 
98L, 0L, 0L, 23L, 10L, 15L, 6L, 33L, 56L, 29L, 98L, 23L, 6L, 
10L, 7L, 65L, 53L, 41L, 25L, 30L, 44L, 65L, 38L, 27L, 20L, 0L, 
8L, 45L, 34L, 26L, 44L, 39L)), class = "data.frame", row.names = c(NA, 
-36L))

答案:data.table

##Reformat into a data.table object
FID_Table<-data.table(FID)
##Summarise Data
Summarised.FID<-FID_Table[, .(FID.Freq=sum(FID),
mean = mean(FID),
sd=sd(FID),
median=median(FID)), 
by = .(Month)]
##data.table results
Month FID.Freq      mean        sd median
1:   January      165 55.000000 10.535654     56
2:  February      182 60.666667 29.737743     65
3:     March      179 59.666667 33.291641     43
4:     April      104 34.666667 16.862186     27
5:       May      124 41.333333 49.571497     20
6:      June       10  3.333333  5.773503      0
7:      July       15  5.000000  4.358899      7
8:    August      133 44.333333 21.007935     45
9: September       97 32.333333 21.548395     34
10:   October       82 27.333333 13.051181     26
11:  November       75 25.000000 19.000000     25
12:  December      102 34.000000  4.582576     33

答案:dplyr

#####Summarise Data###########
###Create a vector to ensure the dates are in the right order
library(tidyverse) #loads dplyr and ggplot2

month_levels = c('January', 'February', 'March', 'April', 'May', 'June', 
'July', 'August', 'September', 'October', 'November', 'December')
##Summarise data
p<-FID.Counts %>%
dplyr::mutate(Month=ordered(Month, levels=month_levels)) %>%
group_by(Month) %>% 
dplyr::summarise(Frequency = sum(FID),
Mean.Month=mean(FID), 
sd.month = sd(FID, na.rm = TRUE),
n_FID=n(),
sem=sd(FID)/sqrt(n()), 
ci_low=Mean.Month-1.96*sem, 
ci_hi=Mean.Month+1.96*sem) %>% 
ungroup()  
##dplyr Results
Month Frequency Mean.Month  sd.month n_FID       sem      ci_low     ci_hi
1    January       165  55.000000 10.535654     3  6.082763  43.0777854 66.922215
2   February       182  60.666667 29.737743     3 17.169094  27.0152431 94.318090
3      March       179  59.666667 33.291641     3 19.220938  21.9936289 97.339704
4      April       104  34.666667 16.862186     3  9.735388  15.5853064 53.748027
5        May       124  41.333333 49.571497     3 28.620117 -14.7620965 97.428763
6       June        10   3.333333  5.773503     3  3.333333  -3.2000000  9.866667
7       July        15   5.000000  4.358899     3  2.516611   0.0674415  9.932558
8     August       133  44.333333 21.007935     3 12.128937  20.5606169 68.106050
9  September        97  32.333333 21.548395     3 12.440972   7.9490287 56.717638
10   October        82  27.333333 13.051181     3  7.535103  12.5645314 42.102135
11  November        75  25.000000 19.000000     3 10.969655   3.4994760 46.500524
12  December       102  34.000000  4.582576     3  2.645751  28.8143274 39.185673

最新更新