For a more detailed explanation of the Conditional Execution of ARM Instructions see Making the most of conditional execution.
Many other recipes also demonstrate good use of the barrel shifter.
When multiplying by a constant value note that using the multiply instruction is often not the optimal solution. The issues involved are discussed in Multiply by a constant.
Notice that because data processing is cheap (much can be achieved in a single instruction), keeping a calculated result in a register for use some considerable time later may be less efficient than recalculating it when it is next needed. This is because it may allow the register so freed to be used for another purpose in the meantime, thus reducing the amount of register spillage.
For example code which takes great care to optimise register usage see some of the examples in Digital signal processing on the ARM.
Loop unrolling
Loop unrolling involves using more than one copy of the inner loop of an
algorithm. The following benefits may be gained by loop unrolling:
Let us consider the number of execution cycles this will take on an ARM6
based processor. IF stands for Instruction Fetch, WB stands for Register
Write Back, R stands for Read, and W stands for Write.
LDR R2, [R0] ; Preload y[i]
Loop
LDR R3, [R0, #4]!! ; Load y[i+1]
SUB R2, R2, R3 ; x[i] = y[i] - y[i+1]
STR R2, [R1], #4 ; Store x[i]
MOV R2, R3 ; y[i+1] is the next y[i]
CMP R0, R4 ; Finished ?
BLT Loop
The loop will execute in the following cycles: 6 IF, 1 R, 1 WB, 1 W, and the branch costs an additional 2 IF cycles. Therefore the total cycle count for processing a 100 element x[] array is:
Now consider the effects of unrolling this loop:
799 IF (198 caused by branching), 101 R, 101 WB, 100 W (1198 cycles)
Code size: 7 instructions
-------------------------------------------------------- Times |IF's caused by |IF saving Unrolled |Branching | -------------------------------------------------------- 2 |98 |100 -------------------------------------------------------- 3 |66 |134 -------------------------------------------------------- 4 |48 |150 -------------------------------------------------------- 5 |38 |160 -------------------------------------------------------- 10 |18 |180 -------------------------------------------------------- 100 |0 |198 --------------------------------------------------------
Therefore if code size is an issue of any importance, unrolling any more than around 3 times is unlikely to pay off with regard to branch overhead elimination.
The number of LDR's or STR's which can be combined into a single LDM or STM is restricted by the number of available registers. In this instance 10 registers is the most which are likely to be usable. This would result in unrolling the loop 10 times for the above example. Another case to consider is unrolling 3 times, as this seems to be a good compromise for branch overhead reduction.
Upon examining the unrolled code below, it can be seen that it is only necessary to execute the MOV once per loop, thus saving another 2 IF cycles per loop for the 3 times unrolled code, and another 9 IF cycles per loop for the 10 times unrolled code.
Here is the code unrolled three times and then optimised as described above:
Analysing how this code executes for a y[] array of size 100, as described
above for the unrolled code produces the following results:
LDR R2, [R0], #4 ; Preload y[i]
Loop
LDMIA R0!, {R3-R5} ; Load y[i+1] to y[i+3]
SUB R2, R2, R3 ; x[i] = y[i] - y[i+1]
SUB R3, R3, R4 ; x[i+1] = y[i+1] - y[i+2]
SUB R4, R4, R5 ; x[i+2] = y[i+2] - y[i+3]
STMIA R1!, {R2-R4} ; Store x[i] to x[i+2]
MOV R2, R5 ; y[i+3] is the next y[i]
CMP R0, R6 ; Finished ?
BLT Loop
Here is the code unrolled ten times and then optimised in the same way:
339 IF (66 caused by branching), 101 R, 34 WB, 100 W (574 cycles)
Code size: 9 instructions
Saving over unrolled code: 460 IF, 67 WB
Analysing how this code executes for a y[] array of size 100, produces the
following results:
LDR R2, [R0], #4 ; Preload y[i]
Loop
LDMIA R0!, {R3-R12} ; Load y[i+1] to y[i+10]
SUB R2, R2, R3 ; x[i] = y[i] - y[i+1]
SUB R3, R3, R4 ; x[i+1] = y[i+1] - y[i+2]
SUB R4, R4, R5 ; x[i+2] = y[i+2] - y[i+3]
SUB R5, R5, R6 ; x[i+3] = y[i+3] - y[i+4]
SUB R6, R6, R7 ; x[i+4] = y[i+4] - y[i+5]
SUB R7, R7, R8 ; x[i+5] = y[i+5] - y[i+6]
SUB R8, R8, R9 ; x[i+6] = y[i+6] - y[i+7]
SUB R9, R9, R10 ; x[i+7] = y[i+7] - y[i+8]
SUB R10, R10, R11 ; x[i+8] = y[i+8] - y[i+9]
SUB R11, R11, R12 ; x[i+9] = y[i+9] - y[i+10]
STMIA R1!, {R2-R11} ; Store x[i] to x[i+9]
MOV R2, R12 ; y[i+10] is the next y[i]
CMP R0, R13 ; Finished ?
BLT Loop
Thus for this problem, unless the extra seven instructions make the code
too large unrolling ten times is likely to be the optimum solution.
169 IF (18 caused by branching), 101 R, 10 WB, 100 W (380 cycles)
Code size: 16 instructions
Saving over unrolled code: 630 IF, 91 WB
However, loop unrolling is not always a good idea, especially when the optimisation between one iteration and the next is small. Consider the following loop which copies an area of memory:
This could be unrolled as follows:
Loop
LDMIA R0!,{R3-R14}
STMIA R1!,{R3-R14}
CMP R0, #LimitAddress
BNE Loop
In this code the CMP and BNE will be executed only a quarter as often, but
this will give only a small saving. However, other issues should be
taken into account:
Loop
LDMIA R0!,{R3-R14}
STMIA R1!,{R3-R14}
LDMIA R0!,{R3-R14}
STMIA R1!,{R3-R14}
LDMIA R0!,{R3-R14}
STMIA R1!,{R3-R14}
LDMIA R0!,{R3-R14}
STMIA R1!,{R3-R14}
CMP R0, #LimitAddress
BLT Loop
Rearranging code so that the write buffer does not cause a stall in this way is often hard, but is frequently worth the effort, and in any case it is always wise to be aware of this performance factor.
Note that this advice is not applicable to systems which use the ARM FPA co-processor.
In a similar vein, commonly called subroutines (especially short ones) will often run more quickly on a cached ARM processor if the entry address is aligned so that it will be loaded into the first word of a cache line. On the ARM610, for example this means quad-word aligned. This ensures that all four words of the first line fetch will be subsequently used by instruction fetches before another line fetch is caused by an instruction fetch. This technique is most useful for large programs which do not cache well, as the number of times the code will have to feteched from memory is not likely to be significant if the program does cache well.
Consider such a system in which the length of memory bursts is B. That is, if executing a long sequence of data operations, the memory accesses which result are: one non-sequential memory cycle followed by B-1 sequential memory cycles. eg. DRAM controlled by the ARM memory manager MEMC1a.
This sequence of memory accesses will be broken up by several ARM instruction types: Load or Store (single or multiple), Data Swap, Branch instructions, SWI's and other instructions which modify the PC.
By placing these instructions carefully, so that they break up the normal sequence of memory cycles only where a non-sequential cycle was about to occur anyway, the number of sequential cycles which are turned into longer non-sequential cycles can be minimised.
For a memory system in which has memory bursts of length B, the optimal position for instructions which break up the memory cycle sequence is 3 words before the next B-word boundary.
To help explain this consider a memory system with memory bursts of length 4 (ie. quad word bursts), the optimal position for these break-up instructions is 16-12=4 bytes from a quad-word offset. To demonstrate this is the case imagine executing the following code in this system:
The memory cycles executing this code will produce are (taking into
account the ARM instruction pipeline):
0x0000 Data Instr 1
0x0004 STR
0x0008 Data Instr 2
0x000C Data Instr 3
0x0010 Data Instr 4
The instruction fetch after the Data Write cycle had to be non-sequential
cycle, but since the instruction fetch was of a quad-word aligned address
it had to be non-sequential anyway. Hence the STR is optimally positioned
to avoid changing sequential instruction fetches into non-sequential
instruction fetches.
Instruction Fetch 0x0000 (Non Seq)
Instruction Fetch 0x0004 (Seq)
Instruction Fetch 0x0008 (Seq) + Execute Data Instr 1
Instruction Fetch 0x000C (Seq) + Execute STR
Data Write (Non Seq)
Instruction Fetch 0x0010 (Non Seq) + Execute Data Instr 2
Instruction Fetch 0x0014 (Seq) + Execute Data Instr 3