如何提高在 Windows 10 上运行 R 脚本的基于 Linux 的 Docker 桌面容器的性能?



我希望能够在 docker 中获得与我在 RStudio 中获得相同的性能。我在 Windows 10 上安装了 Docker 桌面,并且使用的是 Linux 容器。目标是容器化 R 脚本以供常规使用。一个 R 脚本dtbenchmark。R(改编自Matt Dowle的data.table基准测试脚本(,它封装了我遇到的问题,是

library(data.table)
K <- 100L
rows <- c(1e7L, 1:7*1e8L)
for (i in 1:length(rows)) {
tme <- proc.time()
N <- rows[i]
set.seed(1)
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),       # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE),       # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE),  # small groups (char)
id4 = sample(K, N, TRUE),                           # large groups (int)
id5 = sample(K, N, TRUE),                           # large groups (int)
id6 = sample(N/K, N, TRUE),                         # small groups (int)
v1 =  sample(5, N, TRUE),                           # int in range [1,5]
v2 =  sample(5, N, TRUE),                           # int in range [1,5]
v3 =  sample(round(runif(100,max=100),4), N, TRUE)) # numeric e.g. 23.5749
GB <- round(sum(gc()[,2])/1024, 3)
rt <- round(proc.time() - tme, 2)
print(paste0('i = ', i, ' N = ', N, ' K = ', K, ' GB = ', GB, ' seconds = ', rt[3]), quote = FALSE)
rm(N, DT, GB, rt)
}

Dockerfile

FROM rocker/r-ver:3.4.3
RUN Rscript -e "install.packages('https://cran.r-project.org/src/contrib/Archive/data.table/data.table_1.12.0.tar.gz', repo = NULL, type = 'source')" 
COPY . /root
WORKDIR /root
CMD ["Rscript", "dtbenchmark.R"]

在 RStudio 中,脚本dtbenchmark。R能够完成五个循环,然后退出并显示错误消息,如

[1] i = 1 N = 10000000 K = 100 GB = 0.532 seconds = 2.64
[1] i = 2 N = 100000000 K = 100 GB = 4.954 seconds = 44.58
[1] i = 3 N = 200000000 K = 100 GB = 9.868 seconds = 170.53
[1] i = 4 N = 300000000 K = 100 GB = 14.778 seconds = 426.42
[1] i = 5 N = 400000000 K = 100 GB = 19.688 seconds = 1013.77
Error: cannot allocate vector of size 3.7 Gb

使用Dockerfiledtbenchmark。R在同一文件夹中,在Windows PowerShell中,该文件夹中用于构建映像的docker命令是

docker build -t dtbenchmark .

那么Windows PowerShell中运行容器的docker命令是

docker run --rm dtbenchmark:latest

在 PowerShell 中,容器只通过三个循环,然后退出时没有消息,如

[1] i = 1 N = 10000000 K = 100 GB = 0.515 seconds = 2.08
[1] i = 2 N = 100000000 K = 100 GB = 4.937 seconds = 41.3
[1] i = 3 N = 200000000 K = 100 GB = 9.851 seconds = 91.81

我的笔记本电脑有Windows 10 Enterprise,48 GB的RAM和64位操作系统。我无法以管理员身份运行。

所以我完全不熟悉这个过程,但从 Powershell 的角度来看,当我需要一个进程快速完成时,我总是并行运行 foreach 循环和进程。 默认情况下,Powershell 一次将并行处理 5 个循环,但你可以尝试提高该数字。

可能:

foreach -parallel -throttlelimit 5 ($container in $containers){ 
#do something
}

最新更新