可能加快这个gpuArray计算与arrayfun()(或其他)?

我有一个复矩阵A，我想根据A = exp( -1i*(A + abs(A).^2) )把它修改为Nt次。A的大小通常为1000x1000，运行次数约为10000次。

我希望减少执行这些操作所花费的时间。对于CPU上的1000次迭代，我测量到大约6.4秒。根据Matlab文档，我能够将其移动到GPU，这将所需的时间减少到0.07秒(令人难以置信的x91改进!)。到目前为止一切顺利。

然而，我现在也在文档中阅读这个链接，它描述了如果我们使用arrayfun()，我们有时如何发现元素明智计算的进一步改进。如果我尝试遵循教程，所花费的时间实际上更糟糕，为0.47秒。我的测试如下:

Nt = 1000; % Number of times to run each method
test_functionFcn = @test_function;
A = rand( 500, 600, 'double' ) + rand( 500, 600, 'double' )*1i; % Define an initial complex matrix

gpu_A = gpuArray(A); % Transfer matrix to a GPU array
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%%%%%%%%%
cpu_data_out = A;
tic
for k = 1:Nt 
cpu_data_out = test_function( cpu_data_out );
end
tcpu = toc;
%%%%%%%%%%%%%%%%% Run the calculation Nt times on GPU directly %%%%%%%%%%%%%%%%%%%%
gpu_data_out = gpu_A;
tic
for k = 1:Nt
gpu_data_out = test_function(gpu_data_out);
end
tgpu = toc;
%%%%%%%%%%%%%% Run the calculation Nt times on GPU using arrayfun() %%%%%%%%%%%%%%
gpuarrayfun_data_out = gpu_A;
tic
for k = 1:Nt
gpuarrayfun_data_out = arrayfun( test_functionFcn, gpuarrayfun_data_out );
end
tgpu_arrayfun = toc;
%%% Print results %%%
fprintf( 'Time taken using only CPU: %gn', tcpu );
fprintf( 'Time taken using gpuArray directly: %gn', tgpu );
fprintf( 'Time taken using GPU + arrayfun(): %gn', tgpu_arrayfun );
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end

，结果为:

Time taken using only CPU: 6.38785
Time taken using gpuArray directly: 0.0680587
Time taken using GPU + arrayfun(): 0.474612

我的问题是:

在这种情况下，我是否正确地使用了arrayfun()，并且预期arrayfun()应该更糟?
如果是这样，并且它真的只是期望它比直接gpuArray方法慢，是否有任何简单的(即非mex)方法来加速这样的计算?(我看到他们也提到使用pagefun的例子)。

事先感谢您的建议。

(显卡是Nvidia Quadro M4000，我运行的是Matlab R2017a)

编辑

在阅读了@Edric的回答后，我认为有必要展示更多的更广泛的代码。我在OP中没有提到的一件事是，在我实际的主代码中，在k=1:Nt循环中有一个额外的操作，它是一个矩阵乘法与一个稀疏的三对角矩阵的转置。这是一个更充实的MWE，真实发生了什么:

Nt = 1000; % Number of times to run each method
N_rows = 500;
N_cols = 600;
test_functionFcn = @test_function;
A = rand( N_rows, N_cols, 'double' ) + rand( N_rows, N_cols, 'double' )*1i; % Define an initial complex matrix
%%% Generate a sparse, tridiagonal, square transformation matrix %%%%%%%%
mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);
M(1,1) = 6; % Set a couple of other entries
M(2,1) = 3;
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%
cpu_data_out = A;
for k = 1:Nt 
cpu_data_out = test_function( cpu_data_out );
cpu_data_out = cpu_data_out*M.';
end
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end

我很抱歉没有在OP中包括这一点-我当时没有意识到它可能与解决方案相关。这会改变事情吗?在GPU上使用arrayfun()是否仍然有收益，或者现在不适合转换为arrayfun() ?

这里有几点。首先，(也是最重要的)，在GPU上计时代码，你需要使用gputimeit，或者你需要在调用toc之前注入对wait(gpuDevice)的调用。这是因为工作是在GPU上异步启动的，你只有等待它完成才能得到准确的时间。通过这些微小的修改，在我的GPU上，我看到gpuArray方法的0.09秒，arrayfun版本的0.18秒。

运行GPU操作的循环通常是低效的，因此您可以在这里获得的主要收益是通过将循环推入arrayfun函数体中，以便该循环直接在GPU上运行。这样的:

%%% Function to operate on matrices %%%
function x = test_function(x,Nt)
for ii = 1:Nt
x = exp(-1i*(x + abs(x).^2));
end
end

您需要像A = arrayfun(@test_function, A, Nt)那样调用它。在我的GPU上，这将arrayfun的时间降低到0.05秒，所以大约是普通gpuArray版本的两倍。

编辑

相关内容

最新更新

热门标签：