r-嵌套tibbles并使用group_by对每个tibbles执行计算



我想知道每个id_b的主导类是什么。要计算它,我需要找出每个id_bsizeperclass之和。无论哪个类最大,都是分配给id_b的新的主导类。

下面的脚本做了我想做的事情,但感觉很笨重,过于复杂。我以前很少使用嵌套数据,所以我不确定我是否使用了最好的方法。有人能想出一种更整洁的方法来在tidyverse或data.table中实现相同的输出吗?

谢谢!

library(tidyverse)
# sample data
set.seed(123)
input <- tibble(id_a = c(letters[seq(1,10)]),
size = runif(10, min = 10, max = 50),
class = c("x","x","y","x","y",
"y","x","y","x","x"),
id_b = c("A1","A1","B1","B1","B1",
"C1","C1","C1","D1","E1"))
print(input)
id_a   size class id_b 
<chr> <dbl> <chr> <chr>
1 a      23.6 x     A1   
2 b      43.6 x     A1   
3 c      23.9 y     B1   
4 d      23.4 x     B1   
5 e      29.1 y     B1   
6 f      45.7 y     C1   
7 g      44.6 x     C1   
8 h      25.6 y     C1   
9 i      41.1 x     D1   
10 j      48.4 x     E1 
# nest input to create a nested tibble for each id_b
input_nest <- input %>% group_by(id_b) %>% nest()
# calculate dominant class
input_nest_dominant <- input_nest %>% mutate(DOMINANT_CLASS = lapply(data, function(x){
# group each nested tibble by class, and calculate total size. Then find the biggest size and extract 
# the class value
output <- x %>% group_by(class) %>% 
summarise(total_size = sum(size)) %>% 
top_n(total_size, n = 1) %>% 
pull(class)
return(output)
} ))
# unnest to end up with a tibble
input_nest_dominant_clean <- input_nest_dominant %>% 
unnest(cols = c(DOMINANT_CLASS)) %>% 
select(-data) %>% 
ungroup()

print(input_nest_dominant_clean)
id_b  DOMINANT_CLASS
<chr> <chr>         
1 A1    x             
2 B1    y             
3 C1    y             
4 D1    x             
5 E1    x 

在这个例子中,您根本不需要nest,只需使用group_bysummarize计算即可。


input %>%
group_by(id_b, class) %>%
summarize(size = sum(size)) %>%
group_by(id_b) %>%
summarize(DOMINANT_CLASS = class[which.max(size)])
#> # A tibble: 5 x 2
#>   id_b  DOMINANT_CLASS
#>   <chr> <chr>         
#> 1 A1    x             
#> 2 B1    y             
#> 3 C1    y             
#> 4 D1    x             
#> 5 E1    x

这里是一个基本的R解决方案,它使用了两次aggregate,即

agg <-aggregate(size ~ class + id_b, input, FUN = sum)
output <- aggregate(agg[-2],agg[2],FUN = max)[-3]

或更紧凑的

output <- aggregate(.~id_b,
aggregate(size ~ class + id_b, 
input, 
FUN = function(v) sum(v)),
FUN = function(v) tail(sort(v),1))[-3]

使得

> output
id_b class
1   A1     x
2   B1     y
3   C1     y
4   D1     x
5   E1     x

您只需进行1次排序,即可删除所有重复项。类似于:

input %>% arrange(desc(size)) %>% filter(!duplicated(id_b)) %>% arrange(id_b)
# A tibble: 5 x 4
id_a   size class id_b 
<chr> <dbl> <chr> <chr>
1 b      41.5 x     A1   
2 e      47.6 y     B1   
3 h      45.7 y     C1   
4 i      32.1 x     D1   
5 j      28.3 x     E1  

如果id_b的顺序不重要,可以省略最后一个arrange

或在基地R:

input = input[order(-input$size),]
input[!duplicated(input$id_b),]

最新更新