更高的RAM效率.R中的测试和数据争用


  • 我的reprex工作得很好,但我有一个问题,在非常大的数据集中,创建longdf而不是宽的步骤需要花费大量时间,并且在现实生活中会将我的RAM推高到32 GB以上。

  • 进一步,我运行两次的t检验函数

我正在寻找一种获得相同输出但RAM/CPU效率更高的方法。有什么建议吗?我更喜欢整洁的解决方案,但data.table也可以,如果它有助于减少负载的话。

library(tidyverse)
library(stringi)
library(broom)
set.seed(101)
## creating dataset: 
## 10 mln users, 50/50 split in experiment off vs on group
## two variables / measures
w <- tibble(
id=stri_rand_strings(15000000, 10),
variant=rep(c("off", "a", "b"), each=5000000),
variable_a=c(rnorm(n=5000000, mean = 2, sd=1),rnorm(n=5000000, mean = 3, sd=1), rnorm(n=5000000, mean = 3, sd=2)),
variable_b=c(rnorm(n=5000000, mean = 10, sd=2),rnorm(n=5000000, mean = 10, sd=2), rnorm(n=5000000, mean = 10, sd=5))
)
## creating the long data format
## costs RAM (+ 50 %) and time
## Q: is there a way to improve this?
w <- w%>%
gather(variable, values, 3:4)

## creating a t.test function that runs on long data format
p_values <- function(data, control="off", treatment="on"){
data%>% 
## grouping by variable allows to run t.test for each variable
group_by(variable)%>%
do(tidy(with(data = ., t.test(values[variant == control], values[variant == treatment]))))%>%
select(variable, p.value)%>%
mutate(p.value=round(p.value,3))%>%
mutate(variant = treatment)
}
## running the function
## Q: is there a way to improve this?
p_a <- p_values(w, control = "off", treatment = "a")
p_b <- p_values(w, control = "off", treatment = "b")
p <- rbind(p_a, p_b)

## diplsying the results and adding the p values
w %>%
group_by(variant, variable)%>%
summarise(avg=mean(values, na.rm=TRUE))%>%
group_by(variable)%>%
mutate(lift=round((avg/avg[variant=="off"]-1)*100,3))%>%
left_join(p, by = c("variant", "variable"))%>%
pivot_wider(names_from = variant, values_from = c(avg, lift, p.value))%>%
select(-c(lift_off, p.value_off))%>%
relocate(variable, ends_with(c("off","a", "b")))
#> `summarise()` has grouped output by 'variant'. You can override using the
#> `.groups` argument.
#> # A tibble: 2 × 8
#> # Groups:   variable [2]
#>   variable   avg_off avg_a lift_a p.value_a avg_b lift_b p.value_b
#>   <chr>        <dbl> <dbl>  <dbl>     <dbl> <dbl>  <dbl>     <dbl>
#> 1 variable_a    2.00  3.00 50.1       0      3.00 50.2       0    
#> 2 variable_b   10.0  10.0  -0.024     0.053 10.0  -0.012     0.624

创建于2022-08-31由reprex包(v2.0.1(

如果长格式是问题所在,那么我只处理您现有的宽格式数据。

在这里,我已经重写了您的p_values函数,以便在调用gather:之前使用您的初始数据格式

p_values <- function(data, control="off", treatment="on") {

p.val <- sapply(grep("^variable", names(data), value=T), function(var) {
t.test(data[[var]][data$variant==control],
data[[var]][data$variant==treatment])$p.value
})

tibble(variable=names(p.val), p.value=round(p.val, 3), variant=treatment)
}

最新更新