使用矢量的 SIMD 矢量化 C# 代码的运行<T>速度比经典循环慢

我看过几篇文章，描述了Vector<T>如何启用 SIMD 并使用 JIT 内部函数实现，因此编译器将正确输出 AVS/SSE/...使用它时的说明，允许比经典线性循环更快的代码(此处示例)。

我决定尝试重写一种方法，我必须看看我是否设法获得了一些加速，但到目前为止我失败了，矢量化代码的运行速度比原始代码慢 3 倍，我不确定为什么。下面是一个方法的两个版本，用于检查两个Span<float>实例是否将所有项目对都放在同一位置，这些项目对相对于阈值共享相同的位置。

// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)
{
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
for (int i = 0; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
return true;
}
// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)
{
// Setup the test vector
int l = Vector<float>.Count;
float* arr = stackalloc float[l];
for (int i = 0; i < l; i++)
arr[i] = threshold;
Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
{
// Iterate in chunks
int
div = x1.Length / l,
mod = x1.Length % l,
i = 0,
offset = 0;
for (; i < div; i += 1, offset += l)
{
Vector<float>
v1 = Unsafe.Read<Vector<float>>(px1 + offset),
v1cmp = Vector.GreaterThan<float>(v1, cmp),
v2 = Unsafe.Read<Vector<float>>(px2 + offset),
v2cmp = Vector.GreaterThan<float>(v2, cmp);
float*
pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
for (int j = 0; j < l; j++)
if (pcmp1[j] == 0 != (pcmp2[j] == 0))
return false;
}
// Test the remaining items, if any
if (mod == 0) return true;
for (i = x1.Length - mod; i < x1.Length; i++)
if (px1[i] > threshold != px2[i] > threshold)
return false;
}
return true;
}

正如我所说，我已经使用BenchmarkDotNet测试了两个版本，使用Vector<T>的一个运行速度比另一个慢3倍。我尝试使用不同长度的跨度(从大约 100 到 2000 多个)运行测试，但矢量化方法一直比另一种慢得多。

我在这里错过了一些明显的东西吗？

谢谢！

编辑：我使用不安全的代码并尝试在不并行化的情况下尽可能优化此代码的原因是该方法已经在Parallel.For迭代中调用

。此外，能够在多个线程上并行化代码通常不是不优化单个并行任务的好理由。

我遇到了同样的问题。解决方案是取消选中项目属性中的首选 32 位选项。

SIMD 仅对 64 位进程启用。因此，请确保你的应用直接面向 x64，或者编译为"任何 CPU"，而不是标记为首选 32 位。[来源]

** 编辑 ** 在阅读了Marc Gravell的博客文章后，我发现这可以简单地实现......

public static bool MatchElementwiseThresholdSIMD(ReadOnlySpan<float> x1, ReadOnlySpan<float> x2, float threshold)
{
if (x1.Length != x2.Length) throw new ArgumentException("x1.Length != x2.Length");
if (Vector.IsHardwareAccelerated)
{
var vx1 = x1.NonPortableCast<float, Vector<float>>();
var vx2 = x2.NonPortableCast<float, Vector<float>>();
var vthreshold = new Vector<float>(threshold);
for (int i = 0; i < vx1.Length; ++i)
{
var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
if (Vector.Xor(v1cmp, v2cmp) != Vector<int>.Zero)
return false;
}
x1 = x1.Slice(Vector<float>.Count * vx1.Length);
x2 = x2.Slice(Vector<float>.Count * vx2.Length);
}
for (var i = 0; i < x1.Length; i++)
if (x1[i] > threshold != x2[i] > threshold)
return false;
return true;
}

现在，这并不像直接使用阵列那么快(如果您拥有的话)，但仍然比非 SIMD 版本快得多...

(另一个编辑...

。只是为了好玩，我想我会很好地看到这些东西在完全通用的情况下处理工作，答案非常好......所以你可以像下面这样编写代码，它和具体一样高效(除了在非硬件加速的情况下，在这种情况下，它的速度比它慢两倍 - 但并不完全可怕......

public static bool MatchElementwiseThreshold<T>(ReadOnlySpan<T> x1, ReadOnlySpan<T> x2, T threshold)
where T : struct
{
if (x1.Length != x2.Length)
throw new ArgumentException("x1.Length != x2.Length");
if (Vector.IsHardwareAccelerated)
{
var vx1 = x1.NonPortableCast<T, Vector<T>>();
var vx2 = x2.NonPortableCast<T, Vector<T>>();
var vthreshold = new Vector<T>(threshold);
for (int i = 0; i < vx1.Length; ++i)
{
var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
if (Vector.AsVectorInt32(Vector.Xor(v1cmp, v2cmp)) != Vector<int>.Zero)
return false;
}
// slice them to handling remaining elementss
x1 = x1.Slice(Vector<T>.Count * vx1.Length);
x2 = x2.Slice(Vector<T>.Count * vx1.Length);
}
var comparer = System.Collections.Generic.Comparer<T>.Default;
for (int i = 0; i < x1.Length; i++)
if ((comparer.Compare(x1[i], threshold) > 0) != (comparer.Compare(x2[i], threshold) > 0))
return false;
return true;
}

向量只是一个向量。它不声明或保证使用 SIMD 扩展。用

System.Numerics.Vector2

https://learn.microsoft.com/en-us/dotnet/standard/numerics#simd-enabled-vector-types

相关内容

最新更新

热门标签：