优化/并行化r-处理大型数据集，计算R中的SPI

我正在处理一个大型的全局降水数据集（具有非常精细的空间分辨率），以使用R。

中的SPEI软件包来计算标准化的降水指数

我一般我的问题是指使用非常大的数据对数据处理的优化。我在其他帖子中发现了一些讨论（在这里，在这里和此处fo实例），但似乎没有什么类似的。

我的输入是一个矩阵，对降水时间序列的每月观测超过20年（> 20*12行），> 1,000,000点（列）。SPI的计算对每个时间序列都采取一系列步骤，并将指数计算为与中位数的标准偏差。输出是一个列表，其结果矩阵（$ fit）具有相同的输入矩阵。

这里的代码示例：

require(SPEI)
#generating a random values matrix 
data<-replicate(200, rnorm(240))
# erasing negative values
data[data<=0]=0
spi6 <- spi(data, 6, kernel = list(type = 'rectangular', shift = 0), distribution = 'PearsonIII', fit = 'ub-pwm', na.rm = FALSE, ref.start=NULL, ref.end=NULL, x=FALSE, params=NULL)
#testing the results
plot(spi6$fitted[,67])
#taking my results out
results <- t(spi6$fitted)

此脚本效果很好，但是如果我增加了点数（在这种情况下为列），则处理时间呈指数增加。直到达到记忆短缺问题：

Warning messages:
1: In std[ff, s] <- qnorm(cdfpe3(acu.pred[ff], p3par)) :
  Reached total allocation of 16253Mb: see help(memory.size)
2: In std[ff, s] <- qnorm(cdfpe3(acu.pred[ff], p3par)) :
  Reached total allocation of 16253Mb: see help(memory.size)
3: In NextMethod("[<-") :
  Reached total allocation of 16253Mb: see help(memory.size)
4: In NextMethod("[<-") :
  Reached total allocation of 16253Mb: see help(memory.size)
5: In std[ff, s] <- qnorm(pze + (1 - pze) * pnorm(std[ff, s])) :
  Reached total allocation of 16253Mb: see help(memory.size)
6: In std[ff, s] <- qnorm(pze + (1 - pze) * pnorm(std[ff, s])) :
  Reached total allocation of 16253Mb: see help(memory.size)
7: In NextMethod("[<-") :
  Reached total allocation of 16253Mb: see help(memory.size)
8: In NextMethod("[<-") :
  Reached total allocation of 16253Mb: see help(memory.size)

如何将输入矩阵（或将过程分配在输入矩阵上）以平行列向量的过程组（每个过程都是特定点的全职序列），而不会丢失信息（或弄乱信息）我的数据周围）？谢谢。

我发现一个非常线性的，而不是指数，处理时间：

system.time(spi(data[,1], 6))
system.time(spi(data[,1:10], 6))
system.time(spi(data[,1:100], 6))

如果您看到指数增长，则可能是由RAM分配过多的问题引起的。

一种简单的解决方案是将计算分开在矩阵上：

spi6 <- data*NA
system.time(
  for (i in 1:100) spi6[,i] <- spi(data[,i], 6)$fitted
)

或以类似的效率：

system.time(
  spi6 <- apply(data[,1:100], 2, function(x) spi(x, 6)$fitted)
)

您可以看到，计算时间与您提供整个矩阵作为spi()函数的输入的原始选项非常相似。但，如果您遇到内存问题，这可能会解决它们。

另一方面，如果您可以访问多核计算机（如今很可能是这种情况），则可以通过平行计算来看到计算时间的改进：

library(snowfall)
sfInit(parallel=TRUE, cpus=2)
sfLibrary(SPEI)
system.time(
  spi6 <- sfApply(data[,1:100], 2, function(x) spi(x, 6)$fitted)
)
sfStop()

，您可能需要将更高的值设置为ncpu，以获得更高的速度增益，具体取决于计算机支持多少线程。但是， sfApply不会使用非常大的数据集解决您的内存问题。之所以如此，是因为该函数将数据集分配在分配的CPU数量之间。由于系统的总内存没有变化，因此您的内存将用完。

解决方案是合并这两种方法：拆分数据集然后并行化。这样的东西：

data <- replicate(10000, rnorm(240))
sfInit(parallel=TRUE, cpus=10)
sfLibrary(SPEI)
spi6 <- data*NA
for (i in 0:9) {
    chunk <- (i*1000)+(1:1000)
    spi6[,chunk] <- sfApply(data[,chunk], 2, function(x) spi(x, 6)$fitted)
} 
sfStop()

现在，您只需要找到块的最大尺寸，即您传递给sfApply的数据原始数量，以便不溢出可用的RAM。尝试和错误很容易。

相关内容

最新更新

热门标签：