r语言 - 多重聚合和变量计算



我有这样的数据

mydata=structure(list(id = c(15010124001, 15010153006, 15010169005, 
15010228019, 15010229028, 6010001012, 6010012023, 6010014015, 
6010015008, 6020001014, 6020002037), sqr = c("14", "9", "2", 
"21", "13", "26", "17,2", "21,7", "4,7", "32,2", "36,1"), por = c("alpin", 
"alpin", "alpin", "alpin", "alpin", "Yornik birch", "Yornik birch", 
"Yornik birch", "Yornik birch", "Yornik birch", "Yornik birch"
), zap = c("2100", "1100", "1700", "1000", "1300", "200", "197,6744186", 
"170,5069124", "212,7659574", "301,242236", "398,8919668"), zappor = c("1260", 
"330", "850", "1000", "910", "200", "197,6744186", "170,5069124", 
"212,7659574", "301,242236", "398,8919668"), zapvyd = c(2940L, 
990L, 340L, 2100L, 1690L, 520L, 340L, 370L, 100L, 970L, 1440L
), coef = c(6L, 3L, 5L, 10L, 7L, 10L, 10L, 10L, 10L, 10L, 10L
), age = c(130L, 100L, 130L, 150L, 120L, 15L, 15L, 10L, 15L, 
20L, 20L), vys = c(21L, 17L, 19L, 17L, 18L, 2L, 2L, 1L, 2L, 2L, 
2L), diam = c(26L, 18L, 24L, 28L, 22L, 2L, 2L, 2L, 2L, 2L, 2L
), polnot = c("0,6", "0,4", "0,6", "0,4", "0,5", "0,7", "0,8", 
"0,7", "0,7", "0,5", "0,6"), BON = c(4L, 4L, 4L, 5L, 4L, 4L, 
4L, 4L, 5L, 4L, 4L), clust = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 
2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -11L))

i需要对每个por(分类变量)的每一簇sqr进行求和。当然可以

ag <- aggregate(sqr~clust+por , data = mydat, sum)

但不那么简单,因为然后我需要为每个集群计算sqr的百分比。例如,当我手动执行

por clust   sum
alpin   1   25(14+9+2)
alpin   2   34(21+13
Yornik birch    1   43,2
Yornik birch    2   94,7

但是我需要一个更复杂的聚合,我不知道怎么做。因此,我需要计算每个集群占用por变量的特定类别的总sqr的百分比。例如por=alpin的第一个集群。sqr= 25, cluster1的观测总数为3(obs)

3/25 = 0.12 (12%)

作为输出表

por   clust sum       percent
alpin   1   25(14+9+2)  12

之后,我需要计算新的变量。计算所有por和所有clusters类别的sqr

14
9
2
21
13
26
17,2
21,7
4,7
32,2
36,1
sum 169,9

然后除以这个和,即每个por的每个聚类的观测数。例如对于第一个集群高山类别= 3(工作)。在第一个星团中)/169.9 = 0.017657446 (1.7%)最终的表看起来像这样(例如,alpinepor的第一个集群)确实,这个期望的输出

por clust sum percent percent1
alpin 1     25  12      1.7

如何进行这样的转换?

我认为如果您将问题分解为步骤并使用dplyr对每一步进行编码,则会更容易。

  1. 需要创建数值
  2. 需要按组
  3. 执行计算
  4. 按组计算
  5. 为了计算整个和,我们需要取消
  6. 计算秒百分比

mydata %>%
mutate(sqr = as.numeric(gsub(",", ".", sqr))) %>% # --> convert to numeric as it is string
group_by(por, clust) %>% # --> group by what you want
mutate(
pct = length(sqr) / sum(sqr), # --> create first percentage
pct2 = length(id) # --> create second percentage, incomplete for now
) %>%
ungroup() %>% # --> no need to have anything grouped now
mutate(pct2 = pct2 / sum(sqr)) # --> update second percentage with actual calc

最新更新