通过 std::vector 的矩阵乘法比 numpy 慢 10 倍

尽管众所周知使用嵌套std::vector来表示矩阵是一个坏主意，但让我们现在使用它，因为它很灵活，许多现有的函数可以处理std::vector。

我想，在小情况下，速度差异可以忽略不计。但事实证明，vector<vector<double>>比numpy.dot()慢 10+ 倍。

设A和B是大小为sizexsize的矩阵。假设方形矩阵只是为了简单起见。(我们不打算将讨论限制在平方矩阵的情况下。我们以确定性的方式初始化每个矩阵，最后计算C = A * B。

我们将"计算时间"定义为仅计算C = A * B所经过的时间。换句话说，不包括各种开销。

蟒蛇3代码

import numpy as np
import time
import sys
if (len(sys.argv) != 2):
print("Pass `size` as an argument.", file = sys.stderr);
sys.exit(1);
size = int(sys.argv[1]);
A = np.ndarray((size, size));
B = np.ndarray((size, size));
for i in range(size):
for j in range(size):
A[i][j] = i * 3.14 + j
B[i][j] = i * 3.14 - j
start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);

C++代码

using namespace std;
#include <iostream>
#include <vector>
#include <chrono>
int main(int argc, char **argv) {
if (argc != 2) {
cerr << "Pass `size` as an argument.n";
return 1;
}
const unsigned size = atoi(argv[1]);
vector<vector<double>> A(size, vector<double>(size));
vector<vector<double>> B(size, vector<double>(size));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
A[i][j] = i * 3.14 + j;
B[i][j] = i * 3.14 - j;
}
}
auto start = chrono::system_clock::now();
vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
cerr << scientific;
cerr.precision(3);
cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "n";
}

C++代码(多线程)

我们还编写了C++代码的多线程版本，因为numpy.dot()是自动并行计算的。

您可以从GitHub获取所有代码。

结果

C++版本比Python 3(numpy)版本慢10+倍。

matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
C++ (1 thread): 8.66e-03
C++ (2 threads): 4.68e-03
C++ (3 threads): 3.14e-03
C++ (4 threads): 2.43e-03
Python 3: 4.07e-04
-----------------------------------------------
matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
C++ (1 thread): 6.985e-02
C++ (2 threads): 3.647e-02
C++ (3 threads): 2.462e-02
C++ (4 threads): 1.915e-02
Python 3: 1.466e-03
-----------------------------------------------

问题

有没有办法使C++实施更快？

我尝试过的优化

隔夜利息计算订单 ->最多快 3.5 倍(不比numpy代码快，但比C++代码快
)
优化 1 加上部分展开 ->最多快 4.5 倍，~~但这只有在事先知道sizeNo~~.正如本评论所指出的，size不需要知道。我们可以只限制展开循环的循环变量的最大值，并使用正常循环处理剩余元素。例如，请参阅我的实现。
优化 2，加上通过引入一个简单的变量sum-> 来最小化C[i][j]调用，速度最多快 5.2 倍。实现在这里。这个结果意味着std::vector::operator[]是不可忽视的缓慢。
优化 3，加上 g++-march=native标志 ->最多快 6.2 倍(顺便说一下，我们当然使用-O3。
优化 3，加上通过引入指向A元素的指针来减少运算符[]的调用，因为A的元素在展开的循环中是按顺序访问的。 -> 最多快 6.2 倍，比优化 4 快一点。代码如下所示。
g++-funroll-loops标志来展开for循环 ->没有变化
G++#pragma GCC unroll n-> 没有变化
g++-flto标志以打开链接时间优化 ->没有变化
块算法 ->没有变化
转置B以避免缓存未命中 ->无变化
长线性std::vector代替嵌套std::vector<std::vector>、掉期计算顺序、区块算法和部分展开 ->最多快 2.2 倍
1，加上 PGO(按配置文件优化)->速度提高 4.7 倍
优化 3，加上 PGO ->与优化 3 相同
优化 3，加上 g++ 特定__builtin_prefetch()->与优化 3 相同

现状

(最初)慢13.06倍->(目前)慢2.10倍

同样，您可以在GitHub上获取所有代码。但是，让我们引用一些代码，所有这些代码都是从C++代码的多线程版本调用的函数。

原始代码(GitHub)

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
const unsigned j_max = B[0].size();
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int j = 0; j < j_max; ++j) {
for (int k = 0; k < k_max; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}

当前最佳代码(GitHub)

这是上述优化 5 的实现。

void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
static const unsigned num_unroll = 5;
const unsigned j_max = B[0].size();
const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
for (int j = 0; j < j_max; ++j) {
const double *p = A[i].data() + k;
double sum;
sum = *p++ * B[k][j];
sum += *p++ * B[k+1][j];
sum += *p++ * B[k+2][j];
sum += *p++ * B[k+3][j];
sum += *p++ * B[k+4][j];
C[i][j] += sum;
}
}
for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
const double a = A[i][k];
for (int j = 0; j < j_max; ++j) {
C[i][j] += a * B[k][j];
}
}
}
}

自从我们首次发布此问题以来，我们已经尝试了许多优化。我们花了整整两天的时间在这个问题上苦苦挣扎，最后到了不知道如何优化当前最佳代码的地步。我们怀疑像 Strassen 这样更复杂的算法会做得更好，因为我们处理的情况并不大，而且std::vector上的每个操作都非常昂贵，正如我们所看到的，仅仅减少[]调用就可以很好地提高性能。

不过，我们(希望)相信我们可以让它变得更好。

矩阵乘法相对容易优化。但是，如果您想获得体面的 CPU 利用率，它会变得棘手，因为您需要深入了解您正在使用的硬件。实现快速矩阵内核的步骤如下：

使用模拟结构
使用寄存器阻止并一次获取多个数据
针对您的 chache 生产线(主要是 L2 和 L3)进行优化
并行化代码以使用多个线程

在这个链接下是一个非常好的资源，它解释了所有令人讨厌的细节： https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0

如果您想更深入的建议，请发表评论。

我尝试过的优化

现状

相关内容

最新更新

热门标签：