>我有以下数据
df <- data.frame("group1" = c("A","B","B","C","D","D","C","E","E","A","B","B","C","D","D","C","E","E"),
"group2" = c("X","Y","Z","Z","W","F","Z","N","M","D","F","U","T","R","R","S","S","O"),
"val" = c(232,200,3321,400,600,500,22,33,1200,555,200,888,43,600,500,800,900,3213))
我想添加两个新列:
- 第 1 列将显示 group2 值,该值对于每个 group1 值具有最高的"val">
- 第 2 列将执行相同的操作,但它将显示最大值占总数的百分比
因此,首先对数据进行分组:
df %>% group_by(group1,group2) %>% summarise("totalval" = sum(val)) %>% arrange(group1, desc(totalval))
# A tibble: 16 x 3
# Groups: group1 [5]
group1 group2 totalval
<fct> <fct> <dbl>
1 A D 555
2 A X 232
3 B Z 3321
4 B U 888
5 B F 200
6 B Y 200
7 C S 800
8 C Z 422
9 C T 43
10 D R 1100
11 D W 600
12 D F 500
13 E O 3213
14 E M 1200
15 E S 900
16 E N 33
所以我希望第 1 列的值为"D",因为 group2 列中的值"D"在所有组 1 中具有最高值。第 2 列将显示 group1 列中值为"A"的所有行的值 555/(555 + 232( = 0.70。
我找到了一种方法来获取第 1 列,方法是创建一个具有最高值的临时表,然后将其重新连接到主表,但我认为它相当复杂 - 我相信有一种更干净的方法。我也不确定如何添加百分比(如上所述的第 2 列(。
到目前为止,我的解决方案:
#add in overall val to use for percentages
df <- df %>% group_by(group1) %>% mutate("g1_total_val" = sum(val)) %>% ungroup()
#create temp table with selected values
t2 <- df %>% group_by(group1,group2) %>% summarise("totalval" = sum(val)) %>% arrange(group1, desc(totalval)) %>%
slice(1:1) %>% mutate("highest_g2" = group2) %>% select(group1, highest_g2)
df <- df %>% left_join(t2,on = "group1")
有关如何获取第 2 列的任何帮助以及添加第 1 列的更简单方法都会很棒。
您可以使用which.max
获取第一列最大值的索引,并将max
除以第二列的sum
,如下所示:
library(tidyverse)
df %>%
group_by(group1, group2) %>%
summarise(totalval = sum(val)) %>%
arrange(group1, desc(totalval)) %>%
mutate(col1 = group2[which.max(totalval)],
col2 = max(totalval) / sum(totalval))
这给了:
group1 group2 totalval col1 col2
<fct> <fct> <dbl> <fct> <dbl>
1 A D 555 D 0.705
2 A X 232 D 0.705
3 B Z 3321 Z 0.721
4 B U 888 Z 0.721
5 B F 200 Z 0.721
6 B Y 200 Z 0.721
7 C S 800 S 0.632
8 C Z 422 S 0.632
9 C T 43 S 0.632
10 D R 1100 R 0.5
11 D W 600 R 0.5
12 D F 500 R 0.5
13 E O 3213 O 0.601
14 E M 1200 O 0.601
15 E S 900 O 0.601
16 E N 33 O 0.601
编辑如果要保留原始行数,则应将summarise
替换为mutate
,如下所示:
df %>%
group_by(group1, group2) %>%
mutate(totalval = sum(val)) %>%
group_by(group1) %>%
arrange(group1, desc(totalval)) %>%
mutate(col1 = group2[which.max(totalval)],
col2 = max(totalval) / sum(totalval))
请注意,summarise
会自动"剥离"第二个分组变量,但mutate
不会,所以我手动重新进行分组.
这给出了原来的 18 行,添加了 2 列。