r语言 - 在我的应用函数中使用 dplyr 的 slice_sample() 的正确方法是什么?



在下面的代码中,我模拟了增加样本量时的掷骰子次数,并计算了每个样本量下的平均掷骰子次数。我的lapply函数可以工作,但我对它感到不舒服,因为我知道sample_n不是一个dplyr函数,并且已被slice_sample取代。我想让我的代码更好地与dplyr解决方案,而不是在lapply中的sample_n()。我想我可能在应用程序中有其他语法错误。下面是代码:

#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs) 
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size

output <- lapply(X=sample_sizes, FUN = function(var){ 
obs = sample_n(dice_df,var,replace=TRUE) 
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})

最后一步是计算与期望值的差值,3.5。我想用一列表示3.5和样本均值之差。我们应该看到,随着样本量的增加,差异会减小。

output <- output %>%
mutate(difference = across(sample_mean, ~3.5 - .x))

当我运行这个时,它抛出了这个错误:

Error in UseMethod("mutate") : 
no applicable method for 'mutate' applied to an object of class "list"

我试过使用sapply,但我得到一个类似的错误:no applicable method for 'mutate' applied to an object of class "c('matrix', 'array', 'list')"


如果有帮助的话,下面是我使用slice_sample的失败尝试:

output <- lapply(X=sample_sizes, FUN = function(...){ 
obs = slice_sample(dice_df, ..., .preserve=TRUE) 
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, ...)
return(new.df)
})

我得到了这个错误:Error: '...' used in an incorrect context

输出只是list中的单行data.frame元素。我们可以将它们与bind_rows结合并简单地减去一次,而不是多次这样做

library(dplyr)
bind_rows(output) %>% 
mutate(difference = 3.5 - sample_mean )
sample_mean       var  difference
1    3.500000        10  0.00000000
2    2.800000        25  0.70000000
3    3.440000        50  0.06000000
4    3.510000       100 -0.01000000
5    3.495000      1000  0.00500000
6    3.502200     10000 -0.00220000
7    3.502410    100000 -0.00241000
8    3.498094   1000000  0.00190600
9    3.500183 100000000 -0.00018332

slice_samplen参数对应sample_nsize参数。

为了计算output列表的差值,我们可以使用purrr::map代替dplyr::across

library(dplyr)
library(purrr)
set.seed(123)
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = slice_sample(dice_df,n  = var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
output %>%
map(~ 3.5 - .x$sample_mean)
#> [[1]]
#> [1] -0.5
#> 
#> [[2]]
#> [1] 0.42
#> 
#> [[3]]
#> [1] -0.04
#> 
#> [[4]]
#> [1] -0.34
#> 
#> [[5]]
#> [1] 0.025
#> 
#> [[6]]
#> [1] 0.0317
#> 
#> [[7]]
#> [1] 0.00416
#> 
#> [[8]]
#> [1] -2.6e-05
#> 
#> [[9]]
#> [1] -4.405e-05

由reprex包(v0.3.0)创建于2021-08-02

或者,我们可以使用purrr::map_df,并在每个tibble中添加一行diff,正如Martin Gal在评论中提出的那样:

output %>%
map_df(~ tibble(.x, diff = 3.5 - .x$sample_mean))
#> # A tibble: 9 x 3
#>   sample_mean       var       diff
#>         <dbl>     <dbl>      <dbl>
#> 1        2.6         10  0.9      
#> 2        3.28        25  0.220    
#> 3        3.66        50 -0.160    
#> 4        3.5        100  0        
#> 5        3.53      1000 -0.0270   
#> 6        3.50     10000 -0.00180  
#> 7        3.50    100000 -0.00444  
#> 8        3.50   1000000 -0.000226 
#> 9        3.50 100000000 -0.0000669

这是一个基本的R方式-

transform(do.call(rbind, output), difference = 3.5 - sample_mean)
#  sample_mean       var difference
#1        3.80        10  -0.300000
#2        3.44        25   0.060000
#3        3.78        50  -0.280000
#4        3.30       100   0.200000
#5        3.52      1000  -0.015000
#6        3.50     10000  -0.004200
#7        3.50    100000  -0.004370
#8        3.50   1000000   0.002696
#9        3.50 100000000   0.000356

如果你只需要difference的值,你可以做-

3.5 - sapply(output, `[[`, 'sample_mean')

最新更新