矢量化:朋友还是敌人?bsxfun/arrayfun以避免循环、重复、排列、挤压等

这个问题与这个问题有关，可能也与另一个问题有关。

假设你有两个矩阵A和B。A是M-by-N，B是N-by-K。我想得到一个M乘K矩阵C，使得CCD_。我在Matlab中尝试了一些解决方案，我在这里比较它们的计算性能。

% Size of matrices:
M = 4e3;
N = 5e2;
K = 5e1;
GG = 50;    % GG instances
rntm1 = zeros(GG, 1);    % running time of first algorithm
rntm2 = zeros(GG, 1);    % running time of second algorithm
rntm3 = zeros(GG, 1);    % running time of third algorithm
rntm4 = zeros(GG, 1);    % running time of fourth algorithm
rntm5 = zeros(GG, 1);    % running time of fifth algorithm
for gg = 1:GG
    A = rand(M, N);    % M-by-N matrix of random numbers
    A = A ./ repmat(sum(A, 2), 1, N);    % M-by-N matrix of probabilities (?)
    B = rand(N, K);    % N-by-K matrix of random numbers
    B = B ./ repmat(sum(B), N, 1);    % N-by-K matrix of probabilities (?)
    %% First solution
    % One-liner solution:
    tic
    C = squeeze(1 - prod(1 - repmat(A, [1 1 K]) .* permute(repmat(B, [1 1 M]), [3 1 2]), 2));
    rntm1(gg) = toc;

    %% Second solution
    % Full vectorization, using meshgrid, arrayfun and reshape (from Luis Mendo, second link above)
    tic
    [ii jj] = meshgrid(1:size(A, 1), 1:size(B, 2));
    D = arrayfun(@(n) 1 - prod(1 - A(ii(n), :)' .* B(:, jj(n))), 1:numel(ii));
    D = reshape(D, size(B, 2), size(A, 1)).';
    rntm2(gg) = toc;
    clear ii jj
    %% Third solution
    % Partial vectorization 1
    tic
    E = zeros(M, K);
    for hh = 1:M
      tmp = repmat(A(hh, :)', 1, K);
      E(hh, :) = 1 - prod((1 - tmp .* B), 1);
    end
    rntm3(gg) = toc;
    clear tmp hh
    %% Fourth solution
    % Partial vectorization 2
    tic
    F = zeros(M, K);
    for hh = 1:M
      for ii = 1:K
        F(hh, ii) = 1 - prod(1 - A(hh, :)' .* B(:, ii));
      end
    end
    rntm4(gg) = toc;
    clear hh ii
    %% Fifth solution
    % No vectorization at all
    tic
    G = ones(M, K);
    for hh = 1:M
      for ii = 1:K
        for jj = 1:N
          G(hh, ii) = G(hh, ii) * prod(1 - A(hh, jj) .* B(jj, ii));
        end
        G(hh, ii) = 1 - G(hh, ii);
      end
    end
    rntm5(gg) = toc;
    clear hh ii jj C D E F G
end
prctile([rntm1 rntm2 rntm3 rntm4 rntm5], [2.5 25 50 75 97.5])
%    3.6519    3.5261    0.5912    1.9508    2.7576
%    5.3449    6.8688    1.1973    3.3744    3.9940
%    8.1094    8.7016    1.4116    4.9678    7.0312
%    8.8124   10.5170    1.9874    6.1656    8.8227
%    9.5881   12.0150    2.1529    6.6445    9.5115
mean([rntm1 rntm2 rntm3 rntm4 rntm5])
%    7.2420    8.3068    1.4522    4.5865    6.4423
std([rntm1 rntm2 rntm3 rntm4 rntm5])
%    2.1070    2.5868    0.5261    1.6122    2.4900

这些解决方案是等效的，但具有部分矢量化的算法在内存和执行时间方面要高效得多。即使是三重循环似乎也比arrayfun表现得更好！有没有比第三种仅部分矢量化的解决方案更好的方法？

编辑：Dan的解决方案是迄今为止最好的。让rntm6、rntm7和rntm8作为他的第一个、第二个和第三个解决方案的运行时。然后：

prctile(rntm6, [2.5 25 50 75 97.5])
%    0.6337    0.6377    0.6480    0.7110    1.2932
mean(rntm6)
%    0.7440
std(rntm6)
%    0.1970
prctile(rntm7, [2.5 25 50 75 97.5])
%    0.6898    0.7130    0.9050    1.1505    1.4041
mean(rntm7)
%    0.9313
std(rntm7)
%    0.2276
prctile(rntm8, [2.5 25 50 75 97.5])
%    0.5949    0.6005    0.6036    0.6370    1.3529
mean(rntm8)
%    0.6753
std(rntm8)
%    0.1890

使用bsxfun:可以获得较小的性能提升

E = zeros(M, K);
for hh = 1:M
  E(hh, :) = 1 - prod((1 - bsxfun(@times, A(hh,:)', B)), 1);
end

你可以用这个压缩（双关语）多一点性能：

E = squeeze(1 - prod((1-bsxfun(@times, permute(B, [3 1 2]), A)),2));

或者你可以试着为我的第一个建议预先计算转置：

E = zeros(M, K);
At = A';
for hh = 1:M
  E(hh, :) = 1 - prod((1 - bsxfun(@times, At(:,hh), B)), 1);
end

使用arrayfun或bsxfun绝对受益的一种情况是，您有可用的并行计算工具箱和兼容的NVIDIA GPU。在这种情况下，这两个功能的性能非常快，因为主体可以被发送到GPU在那里执行。例如，请参见：http://www.mathworks.co.uk/help/distcomp/examples/improve-performance-of-element-wise-matlab-functions-on-the-gpu-using-arrayfun.html

相关内容

最新更新

热门标签：