对Parallel R来说相当陌生。快速提问。我有一个计算密集型的算法。幸运的是,使用multicore
或snow
可以很容易地将其分解成碎片。我想知道的是,在实践中,将multicore
与snow
结合使用是否可以?
我想做的是将负载分配到集群中的多台机器上,并为每台机器运行。我想使用机器上的所有核心。对于这种类型的处理,将雪与multicore
混合是否合理?
我使用了lockedoff上面建议的方法,即使用并行包在具有多个核心的多台机器上分发令人尴尬的并行工作负载。首先将工作负载分布在所有机器上,然后将每台机器的工作负载分布到其所有核心上。这种方法的缺点是机器之间没有负载平衡(至少我不知道如何实现)。
所有加载的r代码应该是相同的,并且在所有机器(svn)的相同位置。由于初始化集群需要相当长的时间,因此可以通过重用创建的集群来改进下面的代码。
foo <- function(workload, otherArgumentsForFoo) {
source("/home/user/workspace/mycode.R")
...
}
distributedFooOnCores <- function(workload) {
# Somehow assign a batch number to every record
workload$ParBatchNumber = NA
# Split the assigned workload into batches according to DistrParNumber
batches = by(workload, workload$ParBatchNumber, function(x) x)
# Create a cluster with workers on all machines
library("parallel")
cluster = makeCluster(detectCores(), outfile="distributedFooOnCores.log")
batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
stopCluster(cluster)
# Merge the resulting batches
results = someEmptyDataframe
p = 1;
for(i in 1:length(batches)){
results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
p = p + nrow(batches[[i]])
}
# Clean up
workload$ParBatchNumber = NULL
return(invisible(results))
}
distributedFooOnMachines <- function(workload) {
# Somehow assign a batch number to every record
workload$DistrBatchNumber = NA
# Split the assigned activity into batches according to DistrBatchNumber
batches = by(workload, workload$DistrBatchNumber, function(x) x)
# Create a cluster with workers on all machines
library("parallel")
# If makeCluster hangs, please make sure passwordless ssh is configured on all machines
cluster = makeCluster(c("machine1", "etc"), master="ub2", user="", outfile="distributedFooOnMachines.log")
batches = parLapply(cluster, batches, foo, otherArgumentsForFoo)
stopCluster(cluster)
# Merge the resulting batches
results = someEmptyDataframe
p = 1;
for(i in 1:length(batches)){
results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]]
p = p + nrow(batches[[i]])
}
# Clean up
workload$DistrBatchNumber = NULL
return(invisible(results))
}
我对如何改进上述方法很感兴趣。