在下面的代码中,我模拟了增加样本量时的掷骰子次数,并计算了每个样本量下的平均掷骰子次数。我的lapply函数可以工作,但我对它感到不舒服,因为我知道sample_n不是一个dplyr函数,并且已被slice_sample取代。我想让我的代码更好地与dplyr解决方案,而不是在lapply中的sample_n()。我想我可能在应用程序中有其他语法错误。下面是代码:
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = sample_n(dice_df,var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
最后一步是计算与期望值的差值,3.5。我想用一列表示3.5和样本均值之差。我们应该看到,随着样本量的增加,差异会减小。
output <- output %>%
mutate(difference = across(sample_mean, ~3.5 - .x))
当我运行这个时,它抛出了这个错误:
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
我试过使用sapply,但我得到一个类似的错误:no applicable method for 'mutate' applied to an object of class "c('matrix', 'array', 'list')"
如果有帮助的话,下面是我使用slice_sample的失败尝试:
output <- lapply(X=sample_sizes, FUN = function(...){
obs = slice_sample(dice_df, ..., .preserve=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, ...)
return(new.df)
})
我得到了这个错误:Error: '...' used in an incorrect context
输出只是list
中的单行data.frame元素。我们可以将它们与bind_rows
结合并简单地减去一次,而不是多次这样做
library(dplyr)
bind_rows(output) %>%
mutate(difference = 3.5 - sample_mean )
sample_mean var difference
1 3.500000 10 0.00000000
2 2.800000 25 0.70000000
3 3.440000 50 0.06000000
4 3.510000 100 -0.01000000
5 3.495000 1000 0.00500000
6 3.502200 10000 -0.00220000
7 3.502410 100000 -0.00241000
8 3.498094 1000000 0.00190600
9 3.500183 100000000 -0.00018332
slice_sample
的n
参数对应sample_n
的size
参数。
为了计算output
列表的差值,我们可以使用purrr::map
代替dplyr::across
。
library(dplyr)
library(purrr)
set.seed(123)
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = slice_sample(dice_df,n = var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
output %>%
map(~ 3.5 - .x$sample_mean)
#> [[1]]
#> [1] -0.5
#>
#> [[2]]
#> [1] 0.42
#>
#> [[3]]
#> [1] -0.04
#>
#> [[4]]
#> [1] -0.34
#>
#> [[5]]
#> [1] 0.025
#>
#> [[6]]
#> [1] 0.0317
#>
#> [[7]]
#> [1] 0.00416
#>
#> [[8]]
#> [1] -2.6e-05
#>
#> [[9]]
#> [1] -4.405e-05
由reprex包(v0.3.0)创建于2021-08-02
或者,我们可以使用purrr::map_df
,并在每个tibble
中添加一行diff
,正如Martin Gal在评论中提出的那样:
output %>%
map_df(~ tibble(.x, diff = 3.5 - .x$sample_mean))
#> # A tibble: 9 x 3
#> sample_mean var diff
#> <dbl> <dbl> <dbl>
#> 1 2.6 10 0.9
#> 2 3.28 25 0.220
#> 3 3.66 50 -0.160
#> 4 3.5 100 0
#> 5 3.53 1000 -0.0270
#> 6 3.50 10000 -0.00180
#> 7 3.50 100000 -0.00444
#> 8 3.50 1000000 -0.000226
#> 9 3.50 100000000 -0.0000669
这是一个基本的R方式-
transform(do.call(rbind, output), difference = 3.5 - sample_mean)
# sample_mean var difference
#1 3.80 10 -0.300000
#2 3.44 25 0.060000
#3 3.78 50 -0.280000
#4 3.30 100 0.200000
#5 3.52 1000 -0.015000
#6 3.50 10000 -0.004200
#7 3.50 100000 -0.004370
#8 3.50 1000000 0.002696
#9 3.50 100000000 0.000356
如果你只需要difference
的值,你可以做-
3.5 - sapply(output, `[[`, 'sample_mean')