R - 按 2 个变量分组并将最大值设置为新列?

  • 本文关键字:最大值 设置 新列 变量 r
  • 更新时间 :
  • 英文 :


>我有以下数据

df <- data.frame("group1" = c("A","B","B","C","D","D","C","E","E","A","B","B","C","D","D","C","E","E"),
"group2" = c("X","Y","Z","Z","W","F","Z","N","M","D","F","U","T","R","R","S","S","O"),
"val" = c(232,200,3321,400,600,500,22,33,1200,555,200,888,43,600,500,800,900,3213))

我想添加两个新列:

  • 第 1 列将显示 group2 值,该值对于每个 group1 值具有最高的"val">
  • 第 2 列将执行相同的操作,但它将显示最大值占总数的百分比

因此,首先对数据进行分组:

df %>% group_by(group1,group2) %>% summarise("totalval" = sum(val)) %>% arrange(group1, desc(totalval))
# A tibble: 16 x 3
# Groups:   group1 [5]
group1 group2 totalval
<fct>  <fct>     <dbl>
1 A      D           555
2 A      X           232
3 B      Z          3321
4 B      U           888
5 B      F           200
6 B      Y           200
7 C      S           800
8 C      Z           422
9 C      T            43
10 D      R          1100
11 D      W           600
12 D      F           500
13 E      O          3213
14 E      M          1200
15 E      S           900
16 E      N            33

所以我希望第 1 列的值为"D",因为 group2 列中的值"D"在所有组 1 中具有最高值。第 2 列将显示 group1 列中值为"A"的所有行的值 555/(555 + 232( = 0.70。

我找到了一种方法来获取第 1 列,方法是创建一个具有最高值的临时表,然后将其重新连接到主表,但我认为它相当复杂 - 我相信有一种更干净的方法。我也不确定如何添加百分比(如上所述的第 2 列(。

到目前为止,我的解决方案:

#add in overall val to use for percentages
df <- df %>% group_by(group1) %>% mutate("g1_total_val" = sum(val)) %>% ungroup()
#create temp table with selected values
t2 <- df %>% group_by(group1,group2) %>% summarise("totalval" = sum(val)) %>% arrange(group1, desc(totalval)) %>% 
slice(1:1) %>% mutate("highest_g2" = group2) %>% select(group1, highest_g2)
df <- df %>% left_join(t2,on = "group1")

有关如何获取第 2 列的任何帮助以及添加第 1 列的更简单方法都会很棒。

您可以使用which.max获取第一列最大值的索引,并将max除以第二列的sum,如下所示:

library(tidyverse)
df %>%
group_by(group1, group2) %>%
summarise(totalval = sum(val)) %>%
arrange(group1, desc(totalval)) %>% 
mutate(col1 = group2[which.max(totalval)],
col2 = max(totalval) / sum(totalval))

这给了:

group1 group2 totalval col1   col2
<fct>  <fct>     <dbl> <fct> <dbl>
1 A      D           555 D     0.705
2 A      X           232 D     0.705
3 B      Z          3321 Z     0.721
4 B      U           888 Z     0.721
5 B      F           200 Z     0.721
6 B      Y           200 Z     0.721
7 C      S           800 S     0.632
8 C      Z           422 S     0.632
9 C      T            43 S     0.632
10 D      R          1100 R     0.5  
11 D      W           600 R     0.5  
12 D      F           500 R     0.5  
13 E      O          3213 O     0.601
14 E      M          1200 O     0.601
15 E      S           900 O     0.601
16 E      N            33 O     0.601

编辑如果要保留原始行数,则应将summarise替换为mutate,如下所示:

df %>%
group_by(group1, group2) %>%
mutate(totalval = sum(val)) %>%
group_by(group1) %>% 
arrange(group1, desc(totalval)) %>% 
mutate(col1 = group2[which.max(totalval)],
col2 = max(totalval) / sum(totalval))

请注意,summarise会自动"剥离"第二个分组变量,但mutate不会,所以我手动重新进行分组.
这给出了原来的 18 行,添加了 2 列。