汇编语言(x86):如何创建循环来计算斐波那契数列

我使用Visual Studio 2013 Ultimate在MASM中编写汇编语言(x86)。我试图使用数组来计算n个元素的斐波那契序列。换句话说，我试图找到一个数组元素，获得它之前的两个元素，将它们相加，并将结果存储在另一个数组中。

我在设置索引寄存器以使此工作时遇到麻烦。

我的程序设置如下:

TITLE fibonacci.asm
INCLUDE Irvine32.inc
.data
fibInitial  BYTE 0, 1, 2, 3, 4, 5, 6
fibComputed BYTE 5 DUP(0)
.code
main PROC
MOVZX si, fibInitial
MOVZX di, fibComputed
MOV   cl, LENGTHOF fibInitial
L1:
MOV   ax, [si - 1]
MOV   dx, [si - 2]
MOV   bp, ax + dx
MOV   dl, TYPE fibInitial
MOVZX si, dl
MOV   [edi], bp
MOV   dh, TYPE fibComputed
MOVZX di, dl
loop L1
exit
main ENDP
END main

我不能编译这个，因为一个错误消息说"错误A2031:必须是索引或基本寄存器"行MOV ebp, ax + dx。然而，我确信还有其他的逻辑错误被我忽略了。

related: Code-golf打印Fib(10**9)的前1000位数字:我的x86 asm答案使用扩展精度adc循环，并将二进制转换为字符串。内环为速度而优化，其他部分为尺寸而优化。

计算斐波那契序列只需要保持两段状态:当前和前一个元素。我不知道你想用fibInitial做什么，除了计算它的长度。这不是perl的for $n (0..5)。

我知道你只是在学习asm，但我还是要谈谈性能。如果不知道什么是快的，什么是慢的，就没有太多的理由去学习asm。如果您不需要性能，让编译器从C源代码为您生成asm。其他链接请参见https://stackoverflow.com/tags/x86/info

为你的状态使用寄存器简化了在计算a[1]时需要查看a[-1]的问题。从curr=1、prev=0和a[0] = curr开始。生产"现代的";从0开始的斐波那契数列，从curr=0,prev=1开始。

你很幸运，我最近正在考虑一个有效的斐波那契循环代码，所以我花时间写了一个完整的函数。请参阅下面的展开和矢量化版本(节省了存储指令，但即使在32位CPU上编译也可以使64位整型更快):

; fib.asm
;void fib(int32_t *dest, uint32_t count);
; not-unrolled version.  See below for a version which avoids all the mov instructions
global fib
fib:
; 64bit SysV register-call ABI:
; args: rdi: output buffer pointer.  esi: count  (and you can assume the upper32 are zeroed, so using rsi is safe)
;; locals:  rsi: endp
;; eax: current   edx: prev
;; ecx: tmp
;; all of these are caller-saved in the SysV ABI, like r8-r11
;; so we can use them without push/pop to save/restore them.
;; The Windows ABI is different.
test   esi, esi       ; test a reg against itself instead of cmp esi, 0
jz     .early_out     ; count == 0.  
mov    eax, 1         ; current = 1
xor    edx, edx       ; prev    = 0
lea    rsi, [rdi + rsi * 4]  ; endp = &out[count];  // loop-end pointer
;; lea is very useful for combining add, shift, and non-destructive operation
;; this is equivalent to shl rsi, 4  /  add rsi, rdi
align 16
.loop:                    ; do {
mov    [rdi], eax     ;   *buf = current
add    rdi, 4         ;   buf++
lea    ecx, [rax + rdx] ; tmp = curr+prev = next_cur
mov    edx,  eax      ; prev = curr
mov    eax,  ecx      ; curr=tmp
;; see below for an unrolled version that doesn't need any reg->reg mov instructions
; you might think this would be faster:
; add  edx, eax    ; but it isn't
; xchg eax, edx    ; This is as slow as 3 mov instructions, but we only needed 2 thanks to using lea
cmp    rdi, rsi       ; } while(buf < endp);
jb    .loop           ; jump if (rdi BELOW rsi).  unsigned compare
;; the LOOP instruction is very slow, avoid it
.early_out:
ret

另一个循环条件可以是

dec     esi         ; often you'd use ecx for counts, but we had it in esi
jnz     .loop

AMD cpu可以融合cmp/branch，但不能融合dec/branch。Intel cpu也可以宏融合dec/jnz。(或符号小于零/大于零)。dec/inc不更新进位标志，所以你不能在unsignedja/jb以上/以下使用它们。我认为这个想法是你可以在循环中做一个adc(加上进位)，使用inc/dec作为循环计数器来不干扰进位标志，但是部分标志的减速使这在现代cpu上变得很糟糕。

lea ecx, [eax + edx]需要一个额外的字节(地址大小的前缀)，这就是为什么我使用32位的dest和64位的地址。(这些是lea在64位模式下的默认操作数大小)。对速度没有直接影响，只是通过代码大小间接影响。

循环体可以是:

mov  ecx, eax      ; tmp=curr.  This stays true after every iteration
.loop:
mov  [rdi], ecx
add  ecx, edx      ; tmp+=prev  ;; shorter encoding than lea
mov  edx, eax      ; prev=curr
mov  eax, ecx      ; curr=tmp

展开循环以进行更多的迭代将意味着更少的洗牌。而不是mov指令，您只需跟踪哪个寄存器保存哪个变量。也就是说，你用一种寄存器重命名来处理赋值。

.loop:     ;; on entry:       ; curr:eax  prev:edx
mov  [rdi], eax             ; store curr
add  edx, eax             ; curr:edx  prev:eax
.oddentry:
mov  [rdi + 4], edx         ; store curr
add  eax, edx             ; curr:eax  prev:edx
;; we're back to our starting state, so we can loop
add  rdi, 8
cmp  rdi, rsi
jb   .loop

展开的事情是，您需要清理剩下的任何奇怪的迭代。展开因子的2次幂可以使清理循环稍微容易一些，但是加12并不比加16快。(参见这篇文章的前一个版本，使用lea在第三寄存器中生成curr + prev，因为我没有意识到你实际上并不需要一个临时寄存器。感谢rcgldr捕捉到这一点。)

请参阅下面的完整版本，该版本可以处理任何计数。

测试前端(此版本新增:一个金丝雀元素，用于检测超过缓冲区末尾的asm错误。)

// fib-main.c
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
void fib(uint32_t *buf, uint32_t count);
int main(int argc, const char *argv[]) {
uint32_t count = 15;
if (argc > 1) {
count = atoi(argv[1]);
}
uint32_t buf[count+1]; // allocated on the stack
// Fib overflows uint32 at count = 48, so it's not like a lot of space is useful
buf[count] = 0xdeadbeefUL;
// uint32_t count = sizeof(buf)/sizeof(buf[0]);
fib(buf, count);
for (uint32_t i ; i < count ; i++){
printf("%u ", buf[i]);
}
putchar('n');
if (buf[count] != 0xdeadbeefUL) {
printf("fib wrote past the end of buf: sentinel = %xn", buf[count]);
}
}

这段代码是完全工作和测试的(除非我错过了将本地文件中的更改复制回答案>.<):

peter@tesla:~/src/SO$ yasm -f elf64 fib.asm && gcc -std=gnu11 -g -Og fib-main.c fib.o
peter@tesla:~/src/SO$ ./a.out 48
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 512559680

展开版本再次感谢rcgldr，它让我思考如何在循环设置中处理奇数与偶数计数，而不是在最后进行清理迭代。

我选择了无分支的设置代码，它在起始指针上添加了4 * count%2。它可以是零，但加零比分支要便宜得多。斐波那契数列很快就会溢出寄存器，因此保持序言代码的紧凑和高效非常重要，而不仅仅是循环内的代码。(如果我们要优化的话，我们希望优化许多短长度的调用)。

; 64bit SysV register-call ABI
; args: rdi: output buffer pointer.  rsi: count
;; locals:  rsi: endp
;; eax: current   edx: prev
;; ecx: tmp
;; all of these are caller-saved in the SysV ABI, like r8-r11
;; so we can use them without push/pop to save/restore them.
;; The Windows ABI is different.
;void fib(int32_t *dest, uint32_t count);  // unrolled version
global fib
fib:
cmp    esi, 1
jb     .early_out       ; count below 1  (i.e. count==0, since it's unsigned)
mov    eax, 1           ; current = 1
mov    [rdi], eax
je     .early_out       ; count == 1, flags still set from cmp
;; need this 2nd early-out because the loop always does 2 iterations
;;; branchless handling of odd counts:
;;;   always do buf[0]=1, then start the loop from 0 or 1
;;; Writing to an address you just wrote to is very cheap
;;; mov/lea is about as cheap as best-case for branching (correctly-predicted test/jcc for count%2==0)
;;; and saves probably one unconditional jump that would be needed either in the odd or even branch
mov    edx, esi         ;; we could save this mov by using esi for prev, and loading the end pointer into a different reg
and    edx, eax         ; prev = count & 1 = count%2
lea    rsi, [rdi + rsi*4] ; end pointer: same regardless of starting at 0 or 1
lea    rdi, [rdi + rdx*4] ; buf += count%2
;; even count: loop starts at buf[0], with curr=1, prev=0
;; odd  count: loop starts at buf[1], with curr=1, prev=1
align 16  ;; the rest of this func is just *slightly* longer than 16B, so there's a lot of padding.  Tempting to omit this alignment for CPUs with a loop buffer.
.loop:                      ;; do {
mov    [rdi], eax       ;;   *buf = current
; on loop entry: curr:eax  prev:edx
add   edx, eax          ; curr:edx  prev:eax
;.oddentry: ; unused, we used a branchless sequence to handle odd counts
mov   [rdi+4], edx
add   eax, edx          ; curr:eax  prev:edx
;; back to our starting arrangement
add    rdi, 8           ;;   buf++
cmp    rdi, rsi         ;; } while(buf < endp);
jb    .loop
;   dec   esi   ;  set up for this version with sub esi, edx; instead of lea
;   jnz   .loop
.early_out:
ret

要生成以零开始的序列，执行

curr=count&1;   // and esi, 1
buf += curr;    // lea [rdi], [rdi + rsi*4]
prev= 1 ^ curr; // xor eax, esi

代替当前

curr = 1;
prev = count & 1;
buf += count & 1;

我们也可以通过使用esi保存prev来保存两个版本的mov指令，现在prev依赖于count。

;; loop prologue for sequence starting with 1 1 2 3
;; (using different regs and optimized for size by using fewer immediates)
mov    eax, 1               ; current = 1
cmp    esi, eax
jb     .early_out           ; count below 1
mov    [rdi], eax
je     .early_out           ; count == 1, flags still set from cmp
lea    rdx, [rdi + rsi*4]   ; endp
and    esi, eax             ; prev = count & 1
lea    rdi, [rdi + rsi*4]   ; buf += count & 1
;; eax:curr esi:prev    rdx:endp  rdi:buf
;; end of old code
;; loop prologue for sequence starting with 0 1 1 2
cmp    esi, 1
jb     .early_out           ; count below 1, no stores
mov    [rdi], 0             ; store first element
je     .early_out           ; count == 1, flags still set from cmp
lea    rdx, [rdi + rsi*4]   ; endp
mov    eax, 1               ; prev = 1
and    esi, eax             ; curr = count&1
lea    rdi, [rdi + rsi*4]   ; buf += count&1
xor    eax, esi             ; prev = 1^curr
;; ESI:curr EAX:prev  (opposite of other setup)
;;

;; optimized for code size, NOT speed.  Prob. could be smaller, esp. if we want to keep the loop start aligned, and jump between before and after it.
;; most of the savings are from avoiding mov reg, imm32,
;; and from counting down the loop counter, instead of checking an end-pointer.
;; loop prologue for sequence starting with 0 1 1 2
xor    edx, edx
cmp    esi, 1
jb     .early_out         ; count below 1, no stores
mov    [rdi], edx         ; store first element
je     .early_out         ; count == 1, flags still set from cmp
xor    eax, eax  ; movzx after setcc would be faster, but one more byte
shr    esi, 1             ; two counts per iteration, divide by two
;; shift sets CF = the last bit shifted out
setc   al                 ; curr =   count&1
setnc  dl                 ; prev = !(count&1)
lea    rdi, [rdi + rax*4] ; buf+= count&1
;; extra uop or partial register stall internally when reading eax after writing al, on Intel (except P4 & silvermont)
;; EAX:curr EDX:prev  (same as 1 1 2 setup)
;; even count: loop starts at buf[0], with curr=0, prev=1
;; odd  count: loop starts at buf[1], with curr=1, prev=0
.loop:
...
dec  esi                  ; 1B smaller than 64b cmp, needs count/2 in esi
jnz .loop
.early_out:
ret

矢量化:

斐波那契数列不是特别可并行的。没有简单的方法从F(i)和F(i-4)中得到F(i+4)或者类似的东西。对于vector，可以做的是减少内存存储。首先:

a = [f3 f2 f1 f0 ]   -> store this to buf
b = [f2 f1 f0 f-1]

然后a+=b; b+=a; a+=b; b+=a;产生:

a = [f7 f6 f5 f4 ]   -> store this to buf
b = [f6 f5 f4 f3 ]

当使用两个64位整数打包成128b向量时，这就不那么愚蠢了。即使在32位代码中，您也可以使用SSE进行64位整数运算。

这个答案的前一个版本有一个未完成的打包32位矢量版本，不能正确处理count%4 != 0。为了加载序列的前4个值，我使用pmovzxbd，所以当我只使用4B时，我不需要16B的数据。得到第一个-1…将序列中的1个值放入向量寄存器要容易得多，因为只有一个非零值需要加载和洗牌。

;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
mov eax, 1
movd    xmm1, eax               ; xmm1 = [0 1] = [f0 f-1]
pshufd  xmm0, xmm1, 11001111b   ; xmm0 = [1 0] = [f1 f0]
sub esi, 2
jae .entry  ; make the common case faster with fewer branches
;; could put the handling for count==0 and count==1 right here, with its own ret
jmp .cleanup
align 16
.loop:                          ; do {
paddq   xmm0, xmm1          ; xmm0 = [ f3 f2 ]
.entry:
;; xmm1: [ f0 f-1 ]         ; on initial entry, count already decremented by 2
;; xmm0: [ f1 f0  ]
paddq   xmm1, xmm0          ; xmm1 = [ f4 f3 ]  (or [ f2 f1 ] on first iter)
movdqu  [rdi], xmm0         ; store 2nd last compute result, ready for cleanup of odd count
add     rdi, 16         ;   buf += 2
sub esi, 2
jae   .loop             ; } while((count-=2) >= 0);
.cleanup:
;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0
;; xmm1: [ f_rc   f_rc-1 ]  ; rc = count Rounded down to even: count & ~1
;; xmm0: [ f_rc+1 f_rc   ]  ; f(rc+1) is the value we need to store if count was odd
cmp esi, -1
jne   .out  ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
;; xmm1 = [f1 f0]
movhps  [rdi], xmm1         ; store the high 64b of xmm0.  There is no integer version of this insn, but that doesn't matter
.out:
ret

没有必要进一步展开，深链延迟限制了吞吐量，所以我们总是可以平均每个周期存储一个元素。减少循环开销可以帮助超线程，但这是相当小的。

正如您所看到的，处理所有的极端情况，即使展开两个也是非常复杂的。它需要额外的启动开销，即使当您试图优化它以将其保持在最低限度时也是如此。很容易产生大量的条件分支。

主要更新:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>
#ifdef USE32
void fib(uint32_t *buf, uint32_t count);
typedef uint32_t buftype_t;
#define FMTx PRIx32
#define FMTu PRIu32
#define FIB_FN fib
#define CANARY 0xdeadbeefUL
#else
void fib64_sse(uint64_t *buf, uint32_t count);
typedef uint64_t buftype_t;
#define FMTx PRIx64
#define FMTu PRIu64
#define FIB_FN fib64_sse
#define CANARY 0xdeadbeefdeadc0deULL
#endif
#define xstr(s) str(s)
#define str(s) #s
int main(int argc, const char *argv[]) {
uint32_t count = 15;
if (argc > 1) {
count = atoi(argv[1]);
}
int benchmark = argc > 2;
buftype_t buf[count+1]; // allocated on the stack
// Fib overflows uint32 at count = 48, so it's not like a lot of space is useful
buf[count] = CANARY;
// uint32_t count = sizeof(buf)/sizeof(buf[0]);
if (benchmark) {
int64_t reps = 1000000000 / count;
for (int i=0 ; i<=reps ; i++)
FIB_FN(buf, count);
} else {
FIB_FN(buf, count);
for (uint32_t i ; i < count ; i++){
printf("%" FMTu " ", buf[i]);
}
putchar('n');
}
if (buf[count] != CANARY) {
printf(xstr(FIB_FN) " wrote past the end of buf: sentinel = %" FMTx "n", buf[count]);
}
}

性能对于略低于8192的count(适合L1d缓存)，在我的Sandybridge i5-2500k上，每个周期1个存储(每个周期3.5条指令)的理论最大吞吐量附近运行。8192 * 4B/int = 32768 = L1缓存大小。在实践中，我看到~3.3到~3.4个周期。不过，我计算的是Linuxperf stat的整个程序，而不仅仅是紧循环。(注意，重复循环调用Fib函数，所以程序的大部分时间都花在它上面，尽管有一些启动开销。)

无论如何，继续展开没有任何意义。显然，在count=47之后，这不再是斐波那契序列，因为我们使用uint32_t。然而，对于大型count，吞吐量受到内存带宽的限制，下降到~2.6 insn/cycle。在这一点上，我们基本上是在看如何优化memset。

使用movdqu存储(fib64_sse)的64位整数版本每周期运行3次(每两个时钟一个128b存储)，数组大小约为L2缓存大小的1.5倍。(即./fib64 49152)。当数组大小增加到L3缓存大小的较大部分时，性能在L3缓存大小的3/4时下降到每周期约2 insn(每3个时钟一个存储)。在尺寸为>L3缓存。

因此，使用向量存储比使用标量存储对L1d缓存来说太大但适合L2缓存的数组要好。

考虑到fib(93) = 12200160415121876738是适合64位无符号整数的最大值，尝试优化它可能没有太多意义，除非计算fib(n)模取某个(通常是素数)大n。

有一种方法可以在log₂(n)次迭代中直接计算fib(n)，使用fibonacci的lucas序列法或矩阵法。卢卡斯序列更快，如下所示。这些可以修改为执行数学模数。

/* lucas sequence method */
uint64_t fibl(int n) {
uint64_t a, b, p, q, qq, aq;
a = q = 1;
b = p = 0;
while(1){
if(n & 1) {
aq = a*q;
a = b*q + aq + a*p;
b = b*p + aq;
}
n >>= 1;
if(n == 0)
break;
qq = q*q;
q = 2*p*q + qq;
p = p*p + qq;
}
return b;
}

.386
.model flat, stdcall
.stack 4096
ExitProcess proto, dwExitCode:dword
.data
fib word 1, 1, 5 dup(?);you create an array with the number of the fibonacci series that you want to get
.code
main proc
mov esi, offset fib ;set the stack index to the offset of the array.Note that this can also be set to 0
mov cx, lengthof fib ;set the counter for the array to the length of the array. This keeps track of the number of times your loop will go
L1: ;start the loop
mov ax, [esi]; move the first element to ax ;move the first element in the array to the ax register
add ax, [esi + type fib]; add the second element to the value in ax. Which gives the next element in the series
mov[esi + 2* type fib], ax; assign the addition to the third value in the array, i.e the next number in the fibonacci series
add esi, type fib;increment the index to move to the next value
loop L1; repeat
invoke ExitProcess, 0
main endp
end main

矢量化:

相关内容

最新更新

热门标签：