libsimdpp较慢，然后再调试GCC

我需要ARM和X86之间的跨平台SIMD指令。因此，我找到了一个名为libsimdpp的库，并颁布了此示例。

我对其进行了一些更改以将其与添加两个数组的标准CPP方法进行比较，但是Libsimd示例总是更糟。

结果

23毫秒-libsimd
1毫秒 - 普通CPP加法

我使用库的方式或它的构建方式是否有问题。

我对示例的更改。

https://pastebin.com/l14dcrky

#define SIMDPP_ARCH_X86_SSE4_1 true
#include <simdpp/simd.h>
#include <iostream>
#include <chrono>
//example where i got this from
//https://github.com/p12tic/libsimdpp/tree/2e5c0464a8069310d7eb3048e1afa0e96e08f344
// Initializes vector to store values
void init_vector(float* a, float* b, size_t size) {
    for (int i=0; i<size; i++) {
        a[i] = i * 1.0;
        b[i] = (size * 1.0) - i - 1;
    }
}

using namespace simdpp;
int main() {
    //1048576
    const unsigned long SIZE = 4 * 150000;
    float vec_a[SIZE];
    float vec_b[SIZE];
    float result[SIZE];
    ///////////////////////////*/
    //LibSIMDpp
    //*
    auto t1 = std::chrono::high_resolution_clock::now();
    init_vector(vec_a, vec_b, SIZE);
    for (int i=0; i<SIZE; i+=4) {
        float32<4> xmmA = load(vec_a + i);  //loads 4 floats into xmmA
        float32<4> xmmB = load(vec_b + i);  //loads 4 floats into xmmB
        float32<4> xmmC = add(xmmA, xmmB);  //Vector add of xmmA and xmmB
        store(result + i, xmmC);            //Store result into the vector
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
              << " millisecondsn";
    //*/

    ///////////////////////////*/
    //standard
    //*
    init_vector(vec_a, vec_b, SIZE);
    t1 = std::chrono::high_resolution_clock::now();
    for (auto i = 0; i < SIZE; i++) {
        result[i] = vec_a[i]  + vec_b[i];
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
              << " millisecondsn";
    //*/

    int i = 0;
    return 0;
}

正常的调试构建速度减慢手动矢量化的代码远大于降低标量，即使您直接使用 _mm_add_ps intinsics。（通常是因为您倾向于使用更多单独的语句，并分别调试代码 - 基因。）

）

您使用的是C 包装库，因此在调试模式下，这是不会优化的重要额外内容，因为您告诉编译器不要进行。因此，毫不奇怪的是，它的速度太大，以至于比标量差要差。看看为什么这个C 包装类别没有被嵌入方式？例如。（即使__attribute__((always_inline))也无济于事；通过ARGS仍会在重新加载/商店中制作另一个副本）。

不要基准调试构建，它没有用，告诉您非常 -O3性能。（您可能还需要使用-O3 -march=native -ffast-math，具体取决于您的用例。）

相关内容

最新更新

热门标签：