Performance of pow(x,3.0f) vs x*x*x? - Performance of pow(x,3.0f) vs x*x*x? 小贝子编程网

以下程序…

int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}

…在我的机器上完成大约需要900ms。而…

#include <cmath>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += std::pow(x,3.0f);
}
return t;
}

…耗时约6600ms完成。

我有点惊讶，优化器没有内联std::pow函数，使两个程序产生相同的代码，并具有相同的性能。

见解吗?你如何解释5倍的性能差异?

作为参考，我在Linux x86上使用gcc -O3

Update: (C Version)

int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}

…在我的机器上完成大约需要900ms。而…

#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += powf(x,3.0f);
}
return t;
}

…耗时约6600ms完成。

更新2

下面的程序:

#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += __builtin_powif(x,3.0f);
}
return t;
}

和第一个程序一样，运行时间为900ms。

为什么pow不被内联到__builtin_powif?

更新3:

使用-ffast-math执行以下程序:

#include <math.h>
#include <iostream>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += powf(x, 3.0f);
}
std::cout << t;
}

运行时间为227ms (x*x*x版本也是如此)。每次迭代是200皮秒。使用-fopt-info，它说optimized: loop vectorized using 16 byte vectors和optimized: loop with 2 iterations completely unrolled，所以我猜这意味着它在做4批次的迭代SSE和做2迭代一次流水线(共8次迭代一次)，或类似的东西?

关于gcc内置的文档页面是明确的(强调我的):

内置函数:double __builtin_powi (double, int)

返回第一个参数到第二个参数的幂次。与pow函数不同，不保证精度和舍入.

内置函数:float __builtin_powif (float, int)

与__builtin_powi类似，不同之处在于实参和返回类型都是float。

由于__builtin_powif具有与一个单纯乘积相等的性能，这意味着额外的时间用于pow所需的控制，以保证其精度和舍入。

%假设您的编译器选择只调用共享库中的pow，如https://godbolt.org/z/re3baK(没有-ffast-math)

我没有看pow(float, float)是如何实现的，但我看到了一些要点。

x*x*x是内联的，而pow不能，因为它在共享库中-函数调用开销差异
指数3.0是否为常数?如果编译器知道某些东西是常量，它可能会生成更高效的代码
- x*x*x:只生成浮点值乘法两次的汇编
- pow:这必须考虑所有指数值，所以它可能有一般代码(效率较低，可能包括循环)

Performance of pow(x,3.0f) vs xxx?

相关内容

最新更新

热门标签：