如何为 Cortex-M3 优化块复制和右移 + 饱和到 max=5



基本上,我需要通过减小整体代码的大小来减小内存大小,或者提高其运行效率,从而使这段代码更有效率。我正在使用Thumb 2以及Cortex-M3。

我尝试减少使用的 MOV 函数的数量,但虽然这确实减少了整体代码大小,但由于代码的工作方式,它需要每个单独的部分来获取并将结果存储在寄存器中,所以我有点困惑如何改进它。代码目前处于默认状态。

THUMB
AREA RESET, CODE, READONLY
EXPORT  __Vectors
EXPORT Reset_Handler
__Vectors 
DCD 0x00180000     ; top of the stack 
DCD Reset_Handler  ; reset vector - where the program starts
AREA 2a_Code, CODE, READONLY
Reset_Handler
ENTRY
num_words EQU (end_source-source)/4  ; number of words to copy
start   
LDR r0,=source     ; point to the start of the area of memory to copy from
LDR r1,=dest       ; point to the start of the area of memory to copy to
MOV r2,#num_words  ; get the number of words to copy
; find out how many blocks of 8 words need to be copied - it is assumed
; that it is faster to load 8 data items at a time, rather than load
; individually
block
MOVS r3,r2,LSR #3  ; find the number of blocks of 8 words
BEQ individ        ; if no blocks to copy, just copy individual words
; copy and process blocks of 8 words 
block_loop
LDMIA r0!,{r5-r12}  ; get 8 words to copy as a block
MOV r4,r5           ; get first item
BL data_processing  ; process first item 
MOV r5,r4           ; keep first item
MOV r4,r6           ; get second item
BL data_processing  ; process second item 
MOV r6,r4           ; keep second item
MOV r4,r7           ; get third item
BL data_processing  ; process third item
MOV r7,r4           ; keep third item  
MOV r4,r8           ; get fourth item
BL data_processing  ; process fourth item 
MOV r8,r4           ; keep fourth item
MOV r4,r9           ; get fifth item
BL data_processing  ; process fifth item
MOV r9,r4           ; keep fifth item  
MOV r4,r10          ; get sixth item
BL data_processing  ; process sixth item 
MOV r10,r4          ; keep sixth item
MOV r4,r11          ; get seventh item
BL data_processing  ; process seventh item
MOV r11,r4          ; keep seventh item 
MOV r4,r12          ; get eighth item
BL data_processing  ; process eighth item
MOV r12,r4          ; keep eighth item  
STMIA r1!,{r5-r12}  ; copy the 8 words 
SUBS r3,r3,#1       ; move on to the next block
BNE block_loop      ; continue until last block reached
; there may now be some data items available (fewer than 8)
; find out how many of these individual words need to be copied 
individ
ANDS r3,r2,#7   ; find the number of words that remain to copy individually
BEQ exit        ; skip individual copying if none remains
; copy the excess of words
individ_loop
LDR r4,[r0],#4      ; get next word to copy
BL data_processing  ; process the item read
STR r4,[r1],#4      ; copy the word 
SUBS r3,r3,#1       ; move on to the next word
BNE individ_loop    ; continue until the last word reached
; languish in an endless loop once all is done
exit    
B exit
; subroutine to scale a value by 0.5 and then saturate values to a maximum of 5 
data_processing
CMP r4,#10           ; check whether saturation is needed
BLT divide_by_two    ; if not, just divide by 2
MOV r4,#5            ; saturate to 5
BX lr
divide_by_two
MOV r4,r4,LSR #1     ; perform scaling
BX lr
AREA 2a_ROData, DATA, READONLY
source  ; some data to copy
DCD 1,2,3,4,5,6,7,8,9,10,11,0,4,6,12,15,13,8,5,4,3,2,1,6,23,11,9,10 
end_source
AREA 2a_RWData, DATA, READWRITE
dest  ; copy to this area of memory
SPACE end_source-source
end_dest
END

基本上,我需要代码将结果存储在每个寄存器中,同时减小大小或使其执行速度更快。感谢您的帮助。

这是主循环的稍微优化版本。 鉴于您正在为 Cortex M3 编程,因此不可能进行超标量或 SIMD 处理,因为您的 CPU 不支持它。 这和你的代码之间的主要区别是:

  • 所有相关功能都内联
  • 逻辑已优化了一点
  • 无用的移动指令已被省略

此代码在每个表条目中运行 10 个周期,外加一些初始分支的指令以及最终的分支错误预测。

.syntax unified
.thumb
@ r0: source
@ r1: destination
@ r2: number of words to copy
@ the number in front of the comment is the number
@ of cycles needed to execute the instruction
block:  cbz r2, .Lbxlr          @ 2 return if nothing to copy
.Loop:  ldmia r0!, {r3}         @ 2 load one item from source
cmp r3, #10             @ 1 need to scale?
ite lt                  @ 1 if r3 < 10:
lsrlt r3, r3, #1        @ 1 then r3 >>= 1
movge r3, #5            @ 1 else r3 = 5
stmia r1!, {r3}         @ 2 store to destination
subs r2, r2, #1         @ 1 decrement #words
bne .Loop               @ 1 continue if not done yet
.Lbxlr: bx lr

您可以通过展开循环一次,将两个条目的周期减少到 16 个周期(每个条目 8 个周期)。 请注意,这几乎使代码长度增加了三倍,而性能却很小。

.syntax unified
.thumb
@ r0: source
@ r1: destination
@ r2: number of words to copy
@ the number in front of the comment is the number
@ of cycles needed to execute the instruction
@ first check if the number of elements is even or odd
@ leave this out if it's know to be even
block:  tst r2, #1              @ 1 odd number of entries to copy?
beq .Leven              @ 2 if not, proceed with eveness check
ldmia r0!, {r3}         @ 2 load one item from source
cmp r3, #10             @ 1 need to scale?
ite lt                  @ 1 if r3 < 10:
lsrlt r3, r3, #1        @ 1 then r3 >>= 1
movge r3, #5            @ 1 else r3 = 5
stmia r1!, {r3}         @ 2 store to destination
subs r2, r2, #1         @ 1 decrement #words
@ check if any elements are left
@ remove if you know that at least two elements are present
.Leven: cbz r2, .Lbxlr          @ 2 return if no entries left.
.Loop:  ldmia r0!, {r3, r4}     @ 3 load two items from source
cmp r3, #10             @ 1 need to scale?
ite lt                  @ 1 if r3 < 10:
lsrlt r3, r3, #1        @ 1 then r3 >>= 1
movge r3, #5            @ 1 else r3 = 5
cmp r4, #10             @ 1 need to scale?
ite lt                  @ 1 if r5 < 10:
lsrlt r4, r4, #1        @ 1 then r4 >>= 1
movge r4, #5            @ 1 else r4 = 5
stmia r1!, {r3, r4}     @ 3 store to destination
subs r2, r2, #2         @ 1 decrement #words twice
bne .Loop               @ 1 continue if not done yet
.Lbxlr: bx lr

每个元素可以通过四次展开循环来实现 7 个周期,但我认为这太多了。

请注意,这段代码在 GNU 中作为语法。 为您的汇编程序修改它应该是微不足道的。

最新更新