r-数据的内存使用率出乎意料地高.table::frollmean()



我有一个20M行20列的数据表,我对其应用矢量化操作,这些操作返回列表,它们本身通过引用数据表中的其他列来分配。

在这些操作中,内存使用量会以可预测的方式适度增加,直到我使用自适应窗口将(可能是高效的(frollmean()函数应用于包含长度为10的列表的列。在Windows 10 x64上运行R 4.1.2中更小的RepRex,包data.table1.14.2,在执行frollmean()时,内存使用量会激增约17GB,然后再返回,如Windows的任务管理器(性能选项卡(中所示,并在Rprof内存分析报告中测得。

我知道frollmean()在可能的情况下使用并行性,所以我设置了setDTthreads(threads = 1L),以确保内存尖峰不是为其他核心复制数据表的工件。

我的问题是:为什么frollmean()相对于其他操作使用这么多内存,我能避免吗

RepRex

library(data.table)
set.seed(1)
setDTthreads(threads = 1L)
obs   <- 10^3 # Number of rows in the data table
len   <- 10   # Length of each list to be stored in columns
width <- c(seq.int(2), rep(2, len - 2)) # Rolling mean window
# Generate representative data
DT <- data.table(
V1 = sample(x =  1:10, size = obs, replace = TRUE),
V2 = sample(x = 11:20, size = obs, replace = TRUE),
V3 = sample(x = 21:30, size = obs, replace = TRUE)
)
# Apply representative vectorized operations, assigning by reference
DT[, V4 := Map(seq, from = V1, to = V2, length.out = len)] # This is a list
DT[, V5 := Map("*", V4, V3)] # This is a list
DT[, V6 := Map("*", V4, V5)] # This is a list
# Profile the memory usage
Rprof(memory.profiling = TRUE)
# Rolling mean
DT[, V7 := frollmean(x = V6, n = width, adaptive = TRUE)]
# Report the memory usage
Rprof(NULL)
summaryRprof(memory = "both")

考虑避免在列中嵌入列表。回想一下data.framedata.table类是list类型的扩展,其中typeof(DT)返回"list"。因此,与其在嵌套列表上运行frollmean,不如考虑跨矢量列运行:

obs   <- 10^3 # Number of rows in the data table
len   <- 10   # Length of each list to be stored in columns
width <- c(seq.int(2), rep(2, len - 2)) # Rolling mean window
# CALCULATE SEC VECTOR (USING mapply THE PARENT TO ITS WRAPPER Map)
set.seed(1)
V1 = sample(x =  1:10, size = obs, replace = TRUE)
V2 = sample(x = 11:20, size = obs, replace = TRUE)
V3 = sample(x = 21:30, size = obs, replace = TRUE)
seq_vec <- as.vector(mapply(seq, from = V1, to = V2, length.out = len))
# BUILD DATA.TABLE USING SEQ VECTOR FOR FLAT ATOMIC VECTOR COLUMNS
DT_ <- data.table(
WIDTH = rep(width, obs),
V1 = rep(V1, each=len),
V2 = rep(V2, each=len),
V3 = rep(V3, each=len),
V4 = seq_vec
)[, V5 := V4*V3][,V6 := V4*V5]
DT_
WIDTH V1 V2 V3       V4       V5       V6
1:     1  9 20 29  9.00000 261.0000 2349.000
2:     2  9 20 29 10.22222 296.4444 3030.321
3:     2  9 20 29 11.44444 331.8889 3798.284
4:     2  9 20 29 12.66667 367.3333 4652.889
5:     2  9 20 29 13.88889 402.7778 5594.136
---                                          
9996:     2  5 16 26 11.11111 288.8889 3209.877
9997:     2  5 16 26 12.33333 320.6667 3954.889
9998:     2  5 16 26 13.55556 352.4444 4777.580
9999:     2  5 16 26 14.77778 384.2222 5677.951
10000:     2  5 16 26 16.00000 416.0000 6656.000

然后通过V1和V2分组计算frollmean

DT_[, V7 := frollmean(x = V6, n = WIDTH, adaptive = TRUE),  by=.(V1, V2)]

输出应等效于嵌套的列表值列:

identical(DT$V4[[1]], DT_$V4[1:len])
[1] TRUE
identical(DT$V5[[1]], DT_$V5[1:len])
[1] TRUE
identical(DT$V6[[1]], DT_$V6[1:len])
[1] TRUE
identical(DT$V7[[1]], DT_$V7[1:len])
[1] TRUE

这样做,分析显示不同计算方法之间的步骤和内存较少。以下在obs <- 10^5上运行。

嵌套列表列上的frollmean(使用DT(

# Profile the memory usage
Rprof(memory.profiling = TRUE)
DT[, V7 := frollmean(x = V6, n = width, adaptive = TRUE)]
# Report the memory usage
Rprof(NULL)
summaryRprof(mem="both")
$by.self
self.time self.pct total.time total.pct mem.total
"froll"             1.30    76.47       1.30     76.47    1584.6
"FUN"               0.14     8.24       0.30     17.65     161.3
"eval"              0.12     7.06       1.46     85.88    1670.9
"vapply"            0.10     5.88       0.40     23.53     181.3
"parent.frame"      0.04     2.35       0.04      2.35      24.8
$by.total
total.time total.pct mem.total self.time self.pct
"[.data.table"       1.70    100.00    1765.9      0.00     0.00
"["                  1.70    100.00    1765.9      0.00     0.00
"eval"               1.46     85.88    1670.9      0.12     7.06
"froll"              1.30     76.47    1584.6      1.30    76.47
"frollmean"          1.30     76.47    1584.6      0.00     0.00
"vapply"             0.40     23.53     181.3      0.10     5.88
"%chin%"             0.40     23.53     181.3      0.00     0.00
"vapply_1c"          0.40     23.53     181.3      0.00     0.00
"which"              0.40     23.53     181.3      0.00     0.00
"FUN"                0.30     17.65     161.3      0.14     8.24
"parent.frame"       0.04      2.35      24.8      0.04     2.35
$sample.interval
[1] 0.02
$sampling.time
[1] 1.7

frollmean在原子矢量列上分组(使用DT_(

# Profile the memory usage
Rprof(memory.profiling = TRUE)
DT_[, V7 := frollmean(x = V6, n = WIDTH, adaptive = TRUE),  by=.(V1, V2)]
# Report the memory usage
Rprof(NULL)
summaryRprof(mem="both")
$by.self
self.time self.pct total.time total.pct mem.total
"[.data.table"      0.02    33.33       0.06    100.00      18.7
"forderv"           0.02    33.33       0.02     33.33       0.0
"froll"             0.02    33.33       0.02     33.33      10.6
$by.total
total.time total.pct mem.total self.time self.pct
"[.data.table"       0.06    100.00      18.7      0.02    33.33
"["                  0.06    100.00      18.7      0.00     0.00
"forderv"            0.02     33.33       0.0      0.02    33.33
"froll"              0.02     33.33      10.6      0.02    33.33
"frollmean"          0.02     33.33      10.6      0.00     0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 0.06

(有趣的是,在我的8 GB RAM的Linux笔记本电脑上,在10^6obs上,列表列但而非矢量列方法引发了Error: cannot allocate vector of size 15.3 Gb(。

最新更新