为什么我的汇编代码比C实现慢得多



我正在学习汇编。因此,我编写了一个例程,如果输入为非负,则返回其输入的平方根,否则返回0。

我已经在汇编和C中实现了该例程,我想了解为什么用-O2编译的C例程比我的汇编例程快得多。C例程的反汇编代码看起来比我的汇编例程稍微复杂一些,所以我不明白哪里出了问题。

汇编例程(srt.asm(:

global srt
section .text
srt:
pxor xmm1,xmm1
comisd xmm0,xmm1
jbe  P
sqrtsd xmm0,xmm0
retq
P:
pxor xmm0,xmm0
retq

我正在将以上内容编译为

nasm -g -felf64 srt.asm

C例程(srtc.C(

#include <stdio.h>
#include <math.h>
#include <time.h>
extern double srt(double);
double srt1(double x)
{
return sqrt( (x > 0) * x );
}
double srt2(double x)
{
if( x > 0) return sqrt(x);
return 0;
}

int main(void)
{
double v = 0;
clock_t start;
clock_t end;
double niter = 2e8;

start = clock();
v = 0;
for( double i = 0; i < niter; i++ ) {
v += srt(i);
}
end = clock();
printf("time taken srt = %f v=%gn", (double) (end - start)/CLOCKS_PER_SEC,v);
start = clock();
v = 0;
for( double i = 0; i < niter; i++ ) {
v += srt1(i);
}
end = clock();
printf("time taken srt1 = %f v=%gn", (double) (end - start)/CLOCKS_PER_SEC,v);
start = clock();
v = 0;
for( double i = 0; i < niter; i++ ) {
v += srt2(i);
}
end = clock();
printf("time taken srt2 = %f v=%gn", (double) (end - start)/CLOCKS_PER_SEC,v);
return 0;
}

以上内容编译为

gcc -g -O2 srt.o -o srtc srtc.c -lm

程序的输出是

time taken srt = 0.484375 v=1.88562e+12
time taken srt1 = 0.312500 v=1.88562e+12
time taken srt2 = 0.312500 v=1.88562e+12

因此,我的装配程序明显较慢。

拆下的C代码是

Disassembly of section .text:
0000000000000000 <srt1>:
0:   f3 0f 1e fa             endbr64 
4:   66 0f ef c9             pxor   xmm1,xmm1
8:   66 0f 2f c1             comisd xmm0,xmm1
c:   77 04                   ja     12 <srt1+0x12>
e:   f2 0f 59 c1             mulsd  xmm0,xmm1
12:   66 0f 2e c8             ucomisd xmm1,xmm0
16:   66 0f 28 d0             movapd xmm2,xmm0
1a:   f2 0f 51 d2             sqrtsd xmm2,xmm2
1e:   77 05                   ja     25 <srt1+0x25>
20:   66 0f 28 c2             movapd xmm0,xmm2
24:   c3                      ret    
25:   48 83 ec 18             sub    rsp,0x18
29:   f2 0f 11 54 24 08       movsd  QWORD PTR [rsp+0x8],xmm2
2f:   e8 00 00 00 00          call   34 <srt1+0x34>
34:   f2 0f 10 54 24 08       movsd  xmm2,QWORD PTR [rsp+0x8]
3a:   48 83 c4 18             add    rsp,0x18
3e:   66 0f 28 c2             movapd xmm0,xmm2
42:   c3                      ret    
43:   66 66 2e 0f 1f 84 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
4a:   00 00 00 00 
4e:   66 90                   xchg   ax,ax
0000000000000050 <srt2>:
50:   f3 0f 1e fa             endbr64 
54:   66 0f ef c9             pxor   xmm1,xmm1
58:   66 0f 2f c1             comisd xmm0,xmm1
5c:   66 0f 28 d1             movapd xmm2,xmm1
60:   77 0e                   ja     70 <srt2+0x20>
62:   66 0f 28 c2             movapd xmm0,xmm2
66:   c3                      ret    
67:   66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0]
6e:   00 00 
70:   66 0f 2e c8             ucomisd xmm1,xmm0
74:   66 0f 28 d0             movapd xmm2,xmm0
78:   f2 0f 51 d2             sqrtsd xmm2,xmm2
7c:   76 e4                   jbe    62 <srt2+0x12>
7e:   48 83 ec 18             sub    rsp,0x18
82:   f2 0f 11 54 24 08       movsd  QWORD PTR [rsp+0x8],xmm2
88:   e8 00 00 00 00          call   8d <srt2+0x3d>
8d:   f2 0f 10 54 24 08       movsd  xmm2,QWORD PTR [rsp+0x8]
93:   48 83 c4 18             add    rsp,0x18
97:   66 0f 28 c2             movapd xmm0,xmm2
9b:   c3                      ret    

Peter Cordes评论解释了这里发生的事情。srt1和srt2是内联的,而srt不是。引用Peter Cordes的话:

哦,对了,仅仅是一个非内联函数就是问题所在。x86-64System V没有任何保留调用的XMM寄存器,因此添加通过v的依赖链包括srt((的存储/重载,但不包括当srt1或srt2内联时

最新更新