r-从分组数据帧中采样行，条件是使用dplyr进行组级汇总统计

在这篇关于采样行数下限的比例的文章中，我编写了一个函数(见下文(，该函数获取一个包含一些组标识符的数据帧，将数据帧逐组拆分为一个列表，然后对比例和最小行数中较大的一个进行采样。

当这起作用时，我想知道是否有一种有效的方法可以使用summarise或不将group_by()的输出拆分为列表，然后使用类似map/lapply的函数在列表的元素之间迭代。其想法是将数据传递给group_by()，然后传递给summarise()，在那里我将计算每组中的行数，然后使用if_else方法相应地对比例或最小数进行采样。然而，我发现这会产生各种范围界定问题或类型冲突。例如，cur_group或cur_data对于在同一个summary调用中计数和子集似乎很有用，但我不确定如何正确使用它们。

有人知道如何在summarise()中做到这一点，或者避免split()在summarise()之外处理数据吗？

library(dplyr)
# Example data: 10 rows in group a, 100 in group b
df <- data.frame(x = 1:110,
y = rnorm(110),
group = c(rep("a", 10), rep("b", 100)))
# Proportion and minimum number of rows to sample
sample_prop <- 0.5
sample_min <- 8
# Group the data and split each group into a list of tibbles
df_list <- df %>% group_by(group) %>% group_split()
# Checks if the number of rows that would be sampled is below the minimum. If so, 
# sample the minimum number of rows, otherwise sample the proportion. This is 
# what I'm trying to do within a summarise call.
conditional_sample <- function(dat, sample_min, sample_prop) {
if (nrow(dat) * sample_prop < sample_min) {
slice_sample(dat, n = sample_min)
} else{
slice_sample(dat, prop = sample_prop)
}
}
# Apply the function to our list -- ideally this would be unecessary
# within summarise
sampled <- df_list %>%
lapply(., function(x) {
conditional_sample(x, sample_min, sample_prop)
})
bind_rows(sampled) # check out data

一个简单的方法是使用sample_min和sample_prop * n()的max()作为样本大小：

带slice():

library(dplyr)
sample_prop <- 0.5
sample_min <- 8

df %>%
group_by(group) %>%
slice(sample(n(), max(sample_min, floor(sample_prop * n())))) %>%
ungroup()
# A tibble: 58 × 3
x      y group
<int>  <dbl> <chr>
1     1  1.01  a    
2     3 -0.389 a    
3     4  0.559 a    
4     5 -0.594 a    
5     7 -0.415 a    
6     8 -1.63  a    
7     9 -2.27  a    
8    10 -0.422 a    
9    11  0.673 b    
10    12 -1.23  b    
# … with 48 more rows
# ℹ Use `print(n = ...)` to see more rows

或与filter():等效

df %>%
group_by(group) %>%
filter(row_number() %in% sample(n(), max(sample_min, floor(sample_prop * n())))) %>%
ungroup()

相关内容

最新更新

热门标签：