并行与串行结果无法理解



下面的测试结果有些奇怪。我做了并行循环和串行循环,并将它们相互比较。我将测试作为4种形状之一并行串行并行循环首先然后串行循环,最后串行循环首先然后并行循环。我分别将它们编码为p、s、pfsl和sfpl。

的结果如表

tbody> <<tr>直到= 5
测试类型和# 所花费的毫秒数
直到= 4
p # 19472
s # 113459
s # 211323
p # 28854
p # 39253
s # 310669
pfsl # 11421
8299
sfpl # 11708
6280
sfpl # 11657
6334
pfsl # 21400
8191
pfsl # 31443
8488
sfpl # 31784
6475

我对大矩阵乘法的结果非常不同。在6核机器上,正确的Parallel.For可以使性能提高9倍。这是核心的6倍,另外3倍是由于超线程:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
[Host]     : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT

Method |      Mean |    Error |   StdDev | Ratio |
----------------- |----------:|---------:|---------:|------:|
SerialMult | 288.63 ms | 5.736 ms | 8.585 ms |  1.00 |
ParallelMult |  75.92 ms | 1.311 ms | 1.162 ms |  0.26 |
ParallelMultTemp |  49.63 ms | 0.873 ms | 0.817 ms |  0.17 |
ParallelMultVect |  33.04 ms | 0.588 ms | 0.521 ms |  0.11 |

问题的代码是不完整的,很难阅读。除了向列表中添加项目之外,它似乎也没有做任何事情。不可能说出结果数字的含义或它们为什么是这样的。

链接的项目提到了神经网络,所以一个有意义的真正基准将是矩阵乘法。在MLP中,前馈阶段都是关于矩阵乘法的。

Stopwatch对于基准测试没有用处,因为执行可能会被其他程序延迟。即使多次执行相同的方法并对结果求平均值也是不够的,因为可能会出现峰值、预热和缓存影响。这就是为什么现在几乎每个基准测试都使用BenchmarkDotNet包的原因。BDN将运行基准测试,直到它收集到足够的测量来提供统计上正确的结果。

乘法代码是从这篇文章中借来的。

串行乘法法为:

static double[,] MatrixProduct(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
for (int i = 0; i < aRows; ++i) // each row of A
for (int j = 0; j < bCols; ++j) // each col of B
for (int k = 0; k < aCols; ++k) // could use k < bRows
result[i,j] += matrixA[i,k] * matrixB[k,j];
return result;
}

一个朴素的并行版本简单地用Parallel.For

替换外部循环
static double[,] MatrixProductP(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
Parallel.For(0, aRows, i =>
{
// each row of A
for (int j = 0; j < bCols; ++j) // each col of B
for (int k = 0; k < aCols; ++k) // could use k < bRows
result[i, j] += matrixA[i, k] * matrixB[k, j];
});
return result;
}

一个小小的调整是为内循环使用一个临时变量,而不是直接写入结果数组:

static double[,] MatrixProductPTemp(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
Parallel.For(0, aRows, i =>
{
// each row of A
for (int j = 0; j < bCols; ++j) // each col of B
{
double temp=0;
for (int k = 0; k < aCols; ++k) // could use k < bRows
temp += matrixA[i, k] * matrixB[k, j];
result[i, j] = temp;
}
});
return result;
} 

更进一步,在每个并行循环开始时将matrixA中的行复制到本地向量中,以进一步减少缓存丢失和非本地访问的机会:

static double[,] MatrixProductPVect(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
Parallel.For(0, aRows, i =>
{
double[] row = new double[aCols];
System.Buffer.BlockCopy(matrixA, i*aCols*sizeof(double), row, 0, aCols*sizeof(double));
for (int j = 0; j < bCols; ++j) // each col of B
{
double temp=0;

for (int k = 0; k < aCols; ++k) // could use k < bRows
temp += row[k] * matrixB[k, j];
result[i, j] = temp;
}
});
return result;
}
要对矩阵乘法进行基准测试,使用以下. net 6代码。这是整个Program.cs文件
// See https://aka.ms/new-console-template for more information
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
var summary = BenchmarkRunner.Run(typeof(Program).Assembly);
[MarkdownExporterAttribute.StackOverflow]
public class SerialVsParallel
{
private const int L = 500;
private const int N = 1000;
private const int M = 200;
private double[,] _matrixA;
private double[,] _matrixB;

public SerialVsParallel()
{

_matrixA = new double[L,M];
_matrixB = new double[M,N];
var random = new Random(42);
for (var i = 0; i < L; i++)
{
for (var j = 0; j < M; j++)
{
_matrixA[i, j] = random.NextDouble();
}
}
for (var i = 0; i < M; i++)
{
for (var j = 0; j < N; j++)
{
_matrixB[i, j] = random.NextDouble();
}
}
}
static double[,] MatrixProduct(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
for (int i = 0; i < aRows; ++i) // each row of A
for (int j = 0; j < bCols; ++j) // each col of B
for (int k = 0; k < aCols; ++k) // could use k < bRows
result[i,j] += matrixA[i,k] * matrixB[k,j];
return result;
}

static double[,] MatrixProductP(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
Parallel.For(0, aRows, i =>
{
// each row of A
for (int j = 0; j < bCols; ++j) // each col of B
for (int k = 0; k < aCols; ++k) // could use k < bRows
result[i, j] += matrixA[i, k] * matrixB[k, j];
});
return result;
}
static double[,] MatrixProductPTemp(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
Parallel.For(0, aRows, i =>
{
// each row of A
for (int j = 0; j < bCols; ++j) // each col of B
{
double temp=0;
for (int k = 0; k < aCols; ++k) // could use k < bRows
temp += matrixA[i, k] * matrixB[k, j];
result[i, j] = temp;
}
});
return result;
} 

static double[,] MatrixProductPVect(double[,] matrixA,
double[,] matrixB)
{
int aRows = matrixA.GetLength(0); int aCols = matrixA.GetLength(1);
int bRows = matrixB.GetLength(0); int bCols = matrixB.GetLength(1);
if (aCols != bRows)
throw new Exception("xxx");
double[,] result = new double[aRows, bCols];
Parallel.For(0, aRows, i =>
{
double[] row = new double[aCols];
System.Buffer.BlockCopy(matrixA, i*aCols*sizeof(double), row, 0, aCols*sizeof(double));
// each row of A
for (int j = 0; j < bCols; ++j) // each col of B
{
double temp=0;

for (int k = 0; k < aCols; ++k) // could use k < bRows
temp += row[k] * matrixB[k, j];
result[i, j] = temp;
}
});
return result;
}

[Benchmark(Baseline = true)]
public double[,] SerialMult() => MatrixProduct(_matrixA,_matrixB);
[Benchmark]
public double[,] ParallelMult() => MatrixProductP(_matrixA,_matrixB);

[Benchmark]
public double[,] ParallelMultTemp() => MatrixProductPTemp(_matrixA,_matrixB);
[Benchmark]
public double[,] ParallelMultVect() => MatrixProductPVect(_matrixA,_matrixB);

}

运行该命令会产生以下输出

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22621
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.302
[Host]     : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT

Method |      Mean |    Error |   StdDev | Ratio |
----------------- |----------:|---------:|---------:|------:|
SerialMult | 288.63 ms | 5.736 ms | 8.585 ms |  1.00 |
ParallelMult |  75.92 ms | 1.311 ms | 1.162 ms |  0.26 |
ParallelMultTemp |  49.63 ms | 0.873 ms | 0.817 ms |  0.17 |
ParallelMultVect |  33.04 ms | 0.588 ms | 0.521 ms |  0.11 |

在6核机器上,朴素并行方法比串行方法快4倍。这显然没有那么快。然而,仅仅使用temp变量,执行速度就比串行方法快6倍。将行复制到本地缓冲区的性能是串行版本的9倍

这是因为在内循环中跨行读取和写入会导致大量缓存丢失。使用temp变量消除了其中的一些问题。使用行副本可以进一步减少缓存丢失,并且允许处理器利用超线程

最新更新