我有这样的数据
mydata=structure(list(id = c(15010124001, 15010153006, 15010169005,
15010228019, 15010229028, 6010001012, 6010012023, 6010014015,
6010015008, 6020001014, 6020002037), sqr = c("14", "9", "2",
"21", "13", "26", "17,2", "21,7", "4,7", "32,2", "36,1"), por = c("alpin",
"alpin", "alpin", "alpin", "alpin", "Yornik birch", "Yornik birch",
"Yornik birch", "Yornik birch", "Yornik birch", "Yornik birch"
), zap = c("2100", "1100", "1700", "1000", "1300", "200", "197,6744186",
"170,5069124", "212,7659574", "301,242236", "398,8919668"), zappor = c("1260",
"330", "850", "1000", "910", "200", "197,6744186", "170,5069124",
"212,7659574", "301,242236", "398,8919668"), zapvyd = c(2940L,
990L, 340L, 2100L, 1690L, 520L, 340L, 370L, 100L, 970L, 1440L
), coef = c(6L, 3L, 5L, 10L, 7L, 10L, 10L, 10L, 10L, 10L, 10L
), age = c(130L, 100L, 130L, 150L, 120L, 15L, 15L, 10L, 15L,
20L, 20L), vys = c(21L, 17L, 19L, 17L, 18L, 2L, 2L, 1L, 2L, 2L,
2L), diam = c(26L, 18L, 24L, 28L, 22L, 2L, 2L, 2L, 2L, 2L, 2L
), polnot = c("0,6", "0,4", "0,6", "0,4", "0,5", "0,7", "0,8",
"0,7", "0,7", "0,5", "0,6"), BON = c(4L, 4L, 4L, 5L, 4L, 4L,
4L, 4L, 5L, 4L, 4L), clust = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L,
2L, 2L, 2L)), class = "data.frame", row.names = c(NA, -11L))
i需要对每个por
(分类变量)的每一簇sqr
进行求和。当然可以
ag <- aggregate(sqr~clust+por , data = mydat, sum)
但不那么简单,因为然后我需要为每个集群计算sqr
的百分比。例如,当我手动执行
por clust sum
alpin 1 25(14+9+2)
alpin 2 34(21+13
Yornik birch 1 43,2
Yornik birch 2 94,7
但是我需要一个更复杂的聚合,我不知道怎么做。因此,我需要计算每个集群占用por
变量的特定类别的总sqr
的百分比。例如por=alpin
的第一个集群。sqr
= 25, cluster1的观测总数为3(obs)
3/25 = 0.12 (12%)
作为输出表
por clust sum percent
alpin 1 25(14+9+2) 12
之后,我需要计算新的变量。计算所有por
和所有clusters
类别的sqr
14
9
2
21
13
26
17,2
21,7
4,7
32,2
36,1
sum 169,9
然后除以这个和,即每个por
的每个聚类的观测数。例如对于第一个集群高山类别= 3(工作)。在第一个星团中)/169.9 = 0.017657446 (1.7%)最终的表看起来像这样(例如,alpinepor
的第一个集群)确实,这个期望的输出
por clust sum percent percent1
alpin 1 25 12 1.7
如何进行这样的转换?
我认为如果您将问题分解为步骤并使用dplyr对每一步进行编码,则会更容易。
- 需要创建数值
- 需要按组 执行计算
- 按组计算
- 为了计算整个和,我们需要取消 组
- 计算秒百分比
mydata %>%
mutate(sqr = as.numeric(gsub(",", ".", sqr))) %>% # --> convert to numeric as it is string
group_by(por, clust) %>% # --> group by what you want
mutate(
pct = length(sqr) / sum(sqr), # --> create first percentage
pct2 = length(id) # --> create second percentage, incomplete for now
) %>%
ungroup() %>% # --> no need to have anything grouped now
mutate(pct2 = pct2 / sum(sqr)) # --> update second percentage with actual calc