优化打印计数器的循环



我有一个很小的循环程序,将数字从5000000到1打印出来。我想让它运行最快。

我正在学习Linux X86-64组装nasm。

global  main
extern  printf
main:
    push    rbx                     
    mov rax,5000000d
print:
    push    rax                     
    push    rcx                     
    mov     rdi, format             
    mov     rsi, rax                
    call    printf                  
    pop     rcx                     
    pop     rax                     
    dec     rax                     
    jnz     print                   
    pop     rbx                     
    ret
format:
db  "%ld", 10, 0

printf的调用完全占主导地位,即使是高效效率的循环的运行时间。(您是否注意到即使您从未在任何地方使用它,也可以按下/pop rcx?也许使用慢速循环指令剩下的。

要了解有关编写有效X86 ASM的更多信息,请参见Agner Fog的优化组装指南。(以及他的微观结构指南,如果您想真正了解特定CPU的详细信息以及它们如何与众不同:一个Uarch CPU的最佳选择可能不在另一个CPU上。CPU比AMD上的CPU,但是CMOV和ADC在英特尔Predelwell上是2个UOPS,具有2个周期延迟。与AMD上的1个UOPS,因为3输入Alu M-ops(flags 二寄存器)不是AMD的问题。)另请参阅x86标签Wiki中的其他链接。


纯粹在不更改printf的5M调用的情况下纯粹优化循环仅作为如何正确编写循环的示例,而不是实际加速此代码。但是让我们从此开始:

; trivial fixes to loop efficiently while calling the same slow function
global  main
extern  printf
main:
    push    rbx
    mov     ebx, 5000000         ; don't waste a REX prefix for constants that fit in 32 bits
.print:
    ;; removed the push/pops from inside the loop.
    ; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself.
    mov     edi, format          ; static data / code always has a 32-bit address
    mov     esi, ebx
    xor     eax, eax             ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions
    call    printf                  
    dec     ebx
    jnz     .print
    pop     rbx                ; restore rbx, the one call-preserved reg we actually used.
    xor     eax,eax            ; successful exit status.
    ret
section .rodata       ; it's usually best to put constant data in a separate section of the text segment, not right next to code.
format:
db  "%ld", 10, 0

为了加快此速度,我们应该利用将连续整数转换为字符串的冗余。由于"5000000n"只有8个字节长(包括newline),因此字符串表示形式适合64位寄存器。

我们可以将该字符串存储到缓冲区中,并按字符串长度递增指针。(由于它的数字较小,只需将当前的字符串长度保留在寄存器中,您可以在它更改的特殊案例分支中进行更新。)

我们可以将字符串表示形式降低,以避免(re)将整数转换为十进制字符串。

由于随身携带/借用不会自然地在寄存器内传播,并且AAS指令在64位模式下不可用(并且仅在AX上工作,甚至不在EAX上,而且很慢),我们必须这样做我们自己。我们每次都会减少1个,因此我们知道会发生什么。我们可以通过展开10次来处理最不重要的数字,因此没有分支来处理它。

还要注意,由于我们想按打印顺序达到数字,因此随身携带的方向是错误的,因为x86是小的。如果有一个很好的方法可以利用其他字节顺序使我们的字符串使用,我们可以使用BSWAP或MOVBE。(但是请注意,Movbe R64是Skylake上的3个融合域UOPS,其中2个AluUOPS。BSWAPR64也是2个UOPS。)

也许我们应该在XMM矢量寄存器的两半中并行进行奇数/偶数计数器。但是,一旦字符串短于8B,这就停止工作良好。一次存储一个数字串,我们可以轻松重叠。尽管如此,我们可以在矢量规范中进行随身携带的操作,并与Movq和Movhps分别存储两半。或由于从0到5m的数字的4/5是7位数字,因此对于特殊情况下,我们可以存储整个16B向量的特殊情况。

处理较短字符串的更好方法: SSSE3 PSHUFB将两个字符串洗净左包装在矢量寄存器中,然后单个移动一次以一次存储两个。洗牌蒙版仅在字符串长度(数字数)更改时才需要更新,因此不经常执行的随身携带处理特殊案例代码也可以做到这一点。

循环热部分的矢量化应该非常简单明了,并且应大约双重性能。

;;; Optimized version: keep the string data in a register and modify it
;;; instead of doing the whole int->string conversion every time.
section  .bss
printbuf:  resb 1024*128 + 4096     ;  Buffer size ~= half L2 cache size on Intel SnB-family.  Or use a giant buffer that we write() once.  Or maybe vmsplice to give it away to the kernel, since we only run once.
global  main
extern  printf
main:
    push    rbx
    ; use some REX-only regs for values that we're always going to use a REX prefix with anyway for 64-bit operand size.
    mov     rdx, `5000000n`   ; (NASM string constants as integers work like little-endian, so AL = '5' = 0x35 and the high byte holds 'n' = 10).  Note that YASM doesn't support back-ticks for C-style backslash processing.
    mov     r9, 1<<56         ; decrement by 1 in the 2nd-last byte: LSB of the decimal string
    ;xor     r9d, r9d
    ;bts      r9, 56           ; IDK if this code-size optimization outside the loop would help or not.
    mov     eax, 8            ; string length.
    mov     edi, printbuf
.storeloop:
    ;;  rdx = "????x9n".  We compute the start value for the next iteration, i.e. counter -= 10 in rdx.
    mov     r8, rdx
    ;;  r8 = rdx.  We modify it to have each last digit from 9 down to 0 in sequence, and store those strings in the buffer.
    ;;  The string could be any length, always with the first ASCII digit in the low byte; our other constants are adjusted correctly for it
    ;; narrower than 8B means that our stores overlap, but that's fine.
    ;; Starting from here to compute the next unrolled iteration's starting value takes the `sub r8, r9` instructions off the critical path, vs. if we started from r8 at the bottom of the loop.  This gives out-of-order execution more to play with.
    ;;  It means each loop iteration's sequence of subs and stores are a separate dependency chain (except for the store addresses, but OOO can get ahead on those because we only pointer-increment every 2 stores).
    mov     [rdi], r8
    sub     r8, r9             ; r8 = "xxx8n"
    mov     [rdi + rax], r8    ; defer p += len by using a 2-reg addressing mode
    sub     r8, r9             ; r8 = "xxx7n"
    lea     edi, [rdi + rax*2]  ; if we had len*3 in another reg, we could defer this longer
           ;; our static buffer is guaranteed to be in the low 31 bits of address space so we can safely save a REX prefix on the LEA here.  Normally you shouldn't truncate pointers to 32-bits, but you asked for the fastest possible.  This won't hurt, and might help on some CPUs, especially with possible decode bottlenecks.
    ;; repeat that block 3 more times.
    ;; using a short inner loop for the 9..0 last digit might be a win on some CPUs (like maybe Core2), depending on their front-end loop-buffer capabilities if the frontend is a bottleneck at all here.
    ;; anyway, then for the last one:
    mov     [rdi], r8             ; r8 = "xxx1n"
    sub     r8, r9
    mov     [rdi + rax], r8       ; r8 = "xxx0n"
    lea     edi, [rdi + rax*2]

    ;; compute next iteration's RDX.  It's probably a win to interleave some of this into the loop body, but out-of-order execution should do a reasonably good job here.
    mov     rcx, r9
    shr     rcx, 8      ; maybe hoist this constant out, too
    ; rcx = 1 in the second-lowest digit
    sub     rdx, rcx
    ; detect carry when '0' (0x30) - 1 = 0x2F by checking the low bit of the high nibble in that byte.
    shl     rcx, 5
    test    rdx, rcx
    jz      .carry_second_digit
    ; .carry_second_digit is some complicated code to propagate carry as far as it needs to go, up to the most-significant digit.
    ; when it's done, it re-enters the loop at the top, with eax and r9 set appropriately.
    ; it only runs once per 100 digits, so it doesn't have to be super-fast
    ; maybe only do buffer-length checks in the carry-handling branch,
    ; in which case the jz .carry  can be  jnz .storeloop
    cmp     edi, esi              ; } while(p < endp)
    jbe     .storeloop
    ; write() system call on the buffer.
    ; Maybe need a loop around this instead of doing all 5M integer-strings in one giant buffer.
    pop     rbx
    xor     eax,eax            ; successful exit status.
    ret

这不是完全充实的,但应该对什么可能有效的想法。

如果使用SSE2进行矢量化,则可能使用标量整数寄存器来跟踪您何时需要突破并处理携带。即10.

的下场

即使这个标量版本也可能接近每个时钟维持一家商店,这使商店端口饱和。它们只有8b商店(当字符串变短时,有用的部分比那短),因此,如果我们不在缓存失误上瓶颈,我们肯定会在桌子上留下性能。但是,使用3GHz CPU和双通道DDR3-1600(〜25.6GB/s理论最大带宽),每个时钟8B足以用单个核心使主内存饱和。

我们可以并行化它,然后将5m。1范围范围范围范围。有了一些巧妙的数学,我们可以弄清楚什么字节来编写"2500000n"的第一个字符,否则我们可以按正确的顺序呼叫write()本身。(或使用相同的聪明数学来让他们独立致电pwrite(2),以不同的文件偏移,因此内核照顾了多个作家的所有同步。)

您实际上是在打印固定的字符串。我会将该字符串预先为一个较长的常数。

然后,该程序将变成对write的单个调用(或一个短循环来处理不完整的写入)。

最新更新