我有一个Compare()
函数,看起来像这样:
inline bool Compare(bool greater, int p1, int p2) {
if (greater) return p1>=p2;
else return p1<=p2;
}
我决定优化以避免分支:
inline bool Compare2(bool greater, int p1, int p2) {
bool ret[2] = {p1<=p2,p1>=p2};
return ret[greater];
}
然后我这样测试:
bool x = true;
int M = 100000;
int N = 100;
bool a[N];
int b[N];
int c[N];
for (int i=0;i<N; ++i) {
a[i] = rand()%2;
b[i] = rand()%128;
c[i] = rand()%128;
}
// Timed the below loop with both Compare() and Compare2()
for (int j=0; j<M; ++j) {
for (int i=0; i<N; ++i) {
x ^= Compare(a[i],b[i],c[i]);
}
}
结果:Compare(): 3.14ns avg
Compare2(): 1.61ns avg
我会说case-closed,避免分支FTW。但为了完整起见,我替换了
a[i] = rand()%2;
:
a[i] = true;
得到了完全相同的~3.14ns的测量值。假设没有分支发生,编译器实际上重写Compare()
以避免if
语句。那么,为什么Compare2()
更快呢?
不幸的是,我不懂汇编代码,否则我就会尝试自己回答这个问题了。
EDIT:下面是一些汇编:
_Z7Comparebii:
.LFB4:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, %eax
movl %esi, -8(%rbp)
movl %edx, -12(%rbp)
movb %al, -4(%rbp)
cmpb $0, -4(%rbp)
je .L2
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setge %al
jmp .L3
.L2:
movl -8(%rbp), %eax
cmpl -12(%rbp), %eax
setle %al
.L3:
leave
ret
.cfi_endproc
.LFE4:
.size _Z7Comparebii, .-_Z7Comparebii
.section .text._Z8Compare2bii,"axG",@progbits,_Z8Compare2bii,comdat
.weak _Z8Compare2bii
.type _Z8Compare2bii, @function
_Z8Compare2bii:
.LFB5:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, %eax
movl %esi, -24(%rbp)
movl %edx, -28(%rbp)
movb %al, -20(%rbp)
movw $0, -16(%rbp)
movl -24(%rbp), %eax
cmpl -28(%rbp), %eax
setle %al
movb %al, -16(%rbp)
movl -24(%rbp), %eax
cmpl -28(%rbp), %eax
setge %al
movb %al, -15(%rbp)
movzbl -20(%rbp), %eax
cltq
movzbl -16(%rbp,%rax), %eax
leave
ret
.cfi_endproc
.LFE5:
.size _Z8Compare2bii, .-_Z8Compare2bii
.text
现在,执行测试的实际代码可能使用了上述两个函数的内联版本,因此有可能这是要分析的错误代码。话虽如此,我在Compare()
中看到一个jmp
命令,所以我认为这意味着它是分支。如果是这样,我想这个问题变成:为什么当我将a[i]
从rand()%2
更改为true
(或false
)时,分支预测器没有提高Compare()
的性能?
EDIT2:我将"分支预测"替换为"分支",以使我的帖子更明智。
我想我已经弄明白大部分了。
当我在OP编辑中发布函数的汇编时,我注意到内联版本可能不同。我没有检查或发布计时代码,因为它更复杂,因为我认为内联的过程不会改变是否在Compare()
中发生分支。
当我取消内联函数并重复测量时,我得到了以下结果:
Compare(): 7.18ns avg
Compare2(): 3.15ns avg
然后,当我用a[i]=false
替换a[i]=rand()%2
时,我得到了以下内容:
Compare(): 2.59ns avg
Compare2(): 3.16ns avg
这展示了分支预测的增益。事实上,a[i]
替换最初没有产生任何改进,这表明内联删除了分支。
所以最后一个谜团是为什么内联的Compare2()
优于内联的Compare()
。我想我可以发布时序代码的汇编。似乎有足够的理由认为,函数内联方式中的一些怪癖可能会导致这种情况,所以我很高兴在此结束我的研究。我将在我的应用程序中用Compare2()
替换Compare()
。
感谢您的宝贵意见。
编辑:我应该补充说,Compare2
击败所有其他的可能原因是处理器能够并行执行两种比较。这是我这样写函数的直觉。所有其他变体本质上都需要两个逻辑串行操作。
我编写了一个名为Celero的c++库,专门用于测试这种优化和替代方案。(无耻的自我推销:https://github.com/DigitalInBlue/Celero)
我使用以下代码运行您的案例:
class StackOverflowFixture : public celero::TestFixture
{
public:
StackOverflowFixture()
{
}
inline bool NoOp(bool greater, int p1, int p2)
{
return true;
}
inline bool Compare(bool greater, int p1, int p2)
{
if(greater == true)
{
return p1>=p2;
}
return p1<=p2;
}
inline bool Compare2(bool greater, int p1, int p2)
{
bool ret[2] = {p1<=p2,p1>=p2};
return ret[greater];
}
inline bool Compare3(bool greater, int p1, int p2)
{
return (!greater != !(p1 <= p2)) | (p1 == p2);
}
inline bool Compare4(bool greater, int p1, int p2)
{
return (greater ^ (p1 <= p2)) | (p1 == p2);
}
};
BASELINE_F(StackOverflow, Baseline, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(NoOp(rand()%2, rand(), rand()));
}
BENCHMARK_F(StackOverflow, Compare, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(Compare(rand()%2, rand(), rand()));
}
BENCHMARK_F(StackOverflow, Compare2, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(Compare2(rand()%2, rand(), rand()));
}
BENCHMARK_F(StackOverflow, Compare3, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(Compare3(rand()%2, rand(), rand()));
}
BENCHMARK_F(StackOverflow, Compare4, StackOverflowFixture, 100, 5000000)
{
celero::DoNotOptimizeAway(Compare4(rand()%2, rand(), rand()));
}
结果如下:
[==========]
[ CELERO ]
[==========]
[ STAGE ] Baselining
[==========]
[ RUN ] StackOverflow.Baseline -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Baseline (0.690499 sec) [5000000 calls in 690499 usec] [0.138100 us/call] [7241140.103027 calls/sec]
[==========]
[ STAGE ] Benchmarking
[==========]
[ RUN ] StackOverflow.Compare -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Compare (0.782818 sec) [5000000 calls in 782818 usec] [0.156564 us/call] [6387180.672902 calls/sec]
[ BASELINE ] StackOverflow.Compare 1.133699
[ RUN ] StackOverflow.Compare2 -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Compare2 (0.700767 sec) [5000000 calls in 700767 usec] [0.140153 us/call] [7135039.178500 calls/sec]
[ BASELINE ] StackOverflow.Compare2 1.014870
[ RUN ] StackOverflow.Compare3 -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Compare3 (0.709471 sec) [5000000 calls in 709471 usec] [0.141894 us/call] [7047504.408214 calls/sec]
[ BASELINE ] StackOverflow.Compare3 1.027476
[ RUN ] StackOverflow.Compare4 -- 100 samples, 5000000 calls per run.
[ DONE ] StackOverflow.Compare4 (0.712940 sec) [5000000 calls in 712940 usec] [0.142588 us/call] [7013212.893091 calls/sec]
[ BASELINE ] StackOverflow.Compare4 1.032500
[==========]
[ COMPLETE ]
[==========]
对于这个测试,看起来Compare2是这个微优化的最佳选择。
编辑:Compare2 Assembly(最佳情况):
cmp r8d, r9d
movzx eax, dl
setle BYTE PTR ret$[rsp]
cmp r8d, r9d
setge BYTE PTR ret$[rsp+1]
movzx eax, BYTE PTR ret$[rsp+rax]
Compare3 Assembly(次优情况):
xor r11d, r11d
cmp r8d, r9d
mov r10d, r11d
setg r10b
test dl, dl
mov ecx, r11d
sete cl
mov eax, r11d
cmp ecx, r10d
setne al
cmp r8d, r9d
sete r11b
or eax, r11d
这个怎么样?
inline bool Compare3(bool greater, int p1, int p2)
{
return (!greater != !(p1 <= p2)) | (p1 == p2);
}
或
inline bool Compare4(bool greater, int p1, int p2)
{
return (greater ^ (p1 <= p2)) | (p1 == p2);
}