Julia - 超高性能矩阵计算 - >高效的内存管理 - Julia - Very high-performance matrix computation -> efficient memory management 小贝子编程网

我有一个关于Julia中矩阵乘法的计算问题。事实上，我今天就开始学习茱莉亚了。以前我经常使用Python、numpy等。

然而，我有一个6x6矩阵a，我想将其与我已经拥有的初始向量x₁相乘，然后将另一个向量a₁添加到乘积中。然后，我使用它的输出向量x₂作为同一操作的输入，该操作具有相同的矩阵A，但不同的向量A₂等等

即：

迭代：x₁
迭代：x₂=Ax₁+A₁
迭代x₃=Ax₂+A₂
迭代

所涉及的对象非常小，但问题是，我想做很多次这种操作：大约是~10¹¹到~10<sup+13>的次数。此外，我需要非常高的数值精度，远远高于Float64给出的精度。

我能够写出一种直截了当的方法，它做到了我想要的，而且精确到了我想要。在良好/成熟的python实践中，我预定义了我的数组，并在执行过程中用计算值填充它们。然而，我有一种感觉，我被低效的内存管理所束缚，正如分配的数量可能显示的那样。

每隔一段时间(比如每1000次迭代)，我想将当前迭代的向量x_I保存到另一个数组m，这是函数的输出和我的结果。

我认为将向量x的值替换为第I次迭代Ax+A_I计算的结果是有效的。显然，这个就地操作可以由LinearAlgebra包的函数mul!来处理，它使用BLAS来执行mul!(Y, A, B) -> Y。我也研究了StaticArrays包，但我不确定这是否会有所帮助。

我没有看到任何计算速度的加快或内存分配的减少。我能以某种方式优化我的代码吗？注意：为了简单起见，在本例中，我在每次迭代中使用相同的向量a。


using CSV
using DataFrames
using Arblib
using LinearAlgebra
setprecision(350) # this is required for the high accuracy
function compute() # global function wrapper, I thought this would be better
dff = rand(BigFloat,(6,6)); # I would usually import a specific external matrix
df2 = Matrix(dff)
vecci = Array{BigFloat}([0,0,0,0,1e-21,1e-23]); # this will be my offset vector a
x1 = Array{BigFloat}([0,0,0,0,1e-21,1e-23]); # this will be my input vector x0

function fold(mat, x, a) # here it goes
numevals = Int(1e6); #here I would put 1e11

numsaves = 10^(Int(log10(numevals))-3) #takes three orders of magnitude less than numevals, this is the number of iterations I want to save in my array
m = zeros(BigFloat,6, Int(numevals/numsaves)); # my output array where I store the x_i

lastone = zeros(BigFloat,6);
newone = zeros(BigFloat,6);
m[1:6] .= x; # first iteration
lastone .= x; # used as input for the computation
for k in 1:(numevals-2)
mul!(newone, mat, lastone); #replaces current newone with the result of matrix multiplication
newone .+= a; #adds the offset vector
lastone .= newone; # prepares for next iteration

if mod(k,numsaves)==0 # every 1e3 iterations it stores in output array
m[1+6*Int(k/numsaves):6+6*Int(k/numsaves)] .= newone
end

end
return m
end
@time out2 = fold(df2, x1, vecci)
end
compute()

输出例如

12.948946 seconds (204.00 M allocations: 11.399 GiB, 13.52% gc time)
6×1000 Matrix{BigFloat}:
0.0      7.03814e+465  1.93614e+953  5.3262e+1440   1.4652e+1928   4.03066e+2415  1.10881e+2903  …  4.9002e+484980   1.34801e+485468  3.70828e+485955  1.02012e+486443  2.80629e+486930
0.0      6.45989e+465  1.77707e+953  4.8886e+1440   1.34482e+1928  3.6995e+2415   1.01771e+2903     4.4976e+484980   1.23726e+485468  3.40361e+485955  9.3631e+486442   2.57572e+486930
0.0      8.17893e+465  2.24997e+953  6.1895e+1440   1.70269e+1928  4.68398e+2415  1.28853e+2903     5.69445e+484980  1.5665e+485468   4.30935e+485955  1.18547e+486443  3.26115e+486930
0.0      6.71808e+465  1.8481e+953   5.08399e+1440  1.39857e+1928  3.84737e+2415  1.05838e+2903     4.67736e+484980  1.28671e+485468  3.53965e+485955  9.73733e+486442  2.67867e+486930
1.0e-21  7.55747e+465  2.07901e+953  5.7192e+1440   1.57331e+1928  4.32808e+2415  1.19062e+2903     5.26177e+484980  1.44748e+485468  3.98191e+485955  1.0954e+486443   3.01336e+486930
1.0e-23  8.44558e+465  2.32332e+953  6.39129e+1440  1.7582e+1928   4.83669e+2415  1.33054e+2903  …  5.88011e+484980  1.61758e+485468  4.44984e+485955  1.22412e+486443  3.36747e+486930

编辑：

多亏了这些评论，我成功地用类型转换和MultiFloat包而不是BigFloat更新了代码。这使速度提高了3倍左右，但也损失了很多精度。但是分配的数量和所需的内存减少了很多，这已经是一个很大的改进了！

using CSV
using DataFrames
using MultiFloats
using LinearAlgebra
using Arblib
setprecision(350) # this is required for the high accuracy OF THE IMPORT ONLY
function compute() # global function wrapper, I thought this would be better
dff = rand(Float64x5,(6,6)); # I would usually import a specific external matrix
df1 = Array{Float64x5}(dff)
df2 = Matrix(df1)

vecci = Array{Float64x5}([0,0,0,0,1e-21,1e-23]); # this will be my offset vector a
x1 = Array{Float64x5}([0,0,0,0,1e-21,1e-23]); # this will be my input vector x0

function fold(mat::Matrix{MultiFloat{Float64, 5}}, x::Vector{MultiFloat{Float64, 5}}, a::Vector{MultiFloat{Float64, 5}}) # here it goes
numevals = Int(1e6); #here I would put 1e11

numsaves = 10^(Int(log10(numevals))-3) #takes three orders of magnitude less than numevals, this is the number of iterations I want to save in my array
m = zeros(Float64x5,6, Int(numevals/numsaves)); # my output array where I store the x_i

lastone = zeros(Float64x5,6);
newone = zeros(Float64x5,6);
m[1:6] .= x; # first iteration
lastone .= x; # used as input for the computation
for k in 1:(numevals-2)
newone .= mat*lastone + a;
lastone .= newone; # prepares for next iteration

if mod(k,numsaves)==0 # every 1e3 iterations it stores in output array
#print(newone)
m[1+6*Int(k/numsaves):6+6*Int(k/numsaves)] .= newone
end

end
return m
end
@time out2 = fold(df2, x1, vecci)
end
compute()

输出现在是

2.898294 seconds (2.00 M allocations: 580.139 MiB, 4.80% gc time)

编辑2:

经过几次尝试和与同事的讨论，我能够通过使用静态数组大大减少分配的数量。我还注意到了类型转换和预分配。然而，与第二个代码示例相比，这并没有导致速度的提高。相反，代码慢了25%。然而，由于内存消耗要低得多，我会坚持使用它。

此外，对于我的用例，我可能会尝试进行收敛研究，以在使用高精度浮点时找到精度和速度的最佳折衷方案。对于常规浮点运算，单个矩阵乘法和向量加法大约需要140ns，这可能与本例中的速度一样快。只有当精度高出几倍(比如Float64x5)时，操作大约需要4µs，这太疯狂了。

我强烈推荐使用TimerOutputs包来查找代码中的瓶颈。

您的代码只返回了一堆NaN，但如果我减少保存之间的间隔，我可以判断它们在返回NaN之前返回了相同的值。

最重要的是，我认为只要使用MultiFloats，性能就不会有太大的提高。以下代码显著减少了分配，但仅略微提高了速度。我尝试了StaticArrays，它让它运行得更慢，我尝试了Octavian.jl/LoopVectorization.jsl，它在MultiFloats中不太好用。在我的笔记本电脑上，叠加两个Float64x5大约需要32ns，而一个6x6矩阵向量乘积大约需要6^2倍的时间。显然没有并行加速，没有simd效应，也没有可以利用的多线程。基本上，你需要一个更高精度的浮动，这是模拟友好，我不知道任何。

但我对你的代码做了一些改进。评论：

dff是一个矩阵。我不知道你为什么要创建df1和df2，它们看起来是多余的
Float64x5(1e-23)是不准确的，因为在转换之前存在精度损失。Float64x5("1e-23")避免了这个问题
我使fold更加通用，现在它将接受不同的数组类型和元素。(另外，我不喜欢嵌套函数，所以我把它移走了。)
使用整数算术而不是将浮点转换为Int，例如div(a, b)而不是Int(a/b)。并写入10^6而不是Int(1e6)
当索引到矩阵m时，使用矩阵索引而不是笨拙的线性索引
使用mul!来避免分配
线路lastone, newone = newone, lastone以零成本切换两个矢量，无需复制数据

。。。

function compute()
dff = rand(Float64x5, 6, 6)
vecci = Float64x5.(["0", "0", "0", "0", "1e-21", "1e-23"])
x1 = copy(vecci) # this will be my input vector x0
return fold(dff, x1, vecci)
end
function fold(mat::AbstractMatrix{T}, x::AbstractVector{T}, a::AbstractVector{T}) where {T<:Number}
numevals = 10^6
saveinterval = 1000
numsaves = div(numevals, saveinterval)
m = zeros(T, length(a), numsaves)

lastone = copy(x)
newone = zero(x)
m[:, 1] .= x

for k in 1:(numevals-2)
# >95% of the time is spent on the next two lines, though they have zero allocations.
mul!(newone, mat, lastone)
newone .+= a
(d, r) = divrem(k, saveinterval)
if r == 0
m[:, d+1] .= newone
end
lastone, newone = newone, lastone # prepares for next iteration
end
return m
end

Julia - 超高性能矩阵计算 - >高效的内存管理

相关内容

最新更新

热门标签：