The first subsection below is concerned with how to design collections of C functions to maximise low-level efficiency. The following subsection is concerned with the efficiency of larger and more complicated functions.
Often, a leaf function is rather simple. On the ARM, if it is simple enough to compile using just 5 registers (a1-a4 and ip), it will carry no function entry or exit overhead. A surprising proportion of useful leaf functions can be compiled within this constraint.
Once registers have to be saved, it is efficient to save them using STM. In fact the more you can save at one go, the better. In a leaf function, all and only the registers which need to be saved will be saved by a single STMFD sp!,{regs,lr} on entry and a matching LDMFD sp!,{regs,pc} on exit.
In general, the cost of pushing some registers on entry and popping them on exit is very small compared to the cost of the useful work done by a leaf function which is complicated enough to need more than 5 registers.
Overall, you should expect a leaf function to carry virtually no function entry and exit overhead; and at worst, a small overhead, most likely in proportion to the useful work done by it.
On the ARM, if a function ends with a call to another function, that call can be converted to a tail continuation. In functions which need to save no registers, the effect can be dramatic. Consider, for example:
Here, armcc generates:
extern void *__sys_alloc(unsigned type, unsigned n_words);
#define NOTGCable 0x80000000
#define NOTMovable 0x40000000
void *malloc(unsigned n_bytes)
{
return __sys_alloc(NOTGCable+NOTMovable, n_bytes/4);
}
There is no function entry or exit overhead-just useful work massaging
arguments-and the function return has disappeared entirely - return is
direct from __sys_alloc to malloc's caller. In this case, the basic
call-return cost for the function pair has been reduced from:
malloc
MOV a2,a1,LSR #2
MOV a1,#&c0000000
B |__sys_alloc|
to:
BL + BL + MOV pc,lr + MOV pc,lr
a saving of 25%.
BL + B + MOV pc,lr
More complicated functions in which the only function calls are immediately before a return, collapse equally well. An artificial example is:
extern int f1(int), int f2(int, int);
armcc generates the following, wonderfully efficient code:
int f(int a, int b)
{
if (b == 0)
return a;
else if (b < 0)
return f2(a, -b);
else
return f2(b, a); /* argument order
swapped */
}
f CMP a2,#0
MOVEQS pc,lr
RSBLT a2,a2,#0
BLT f2
MOV a3,a1
MOV a1,a2
MOV a2,a3
B f2
In this case, the entry and register-save overhead caused by the
infrequent heavyweight path through the function applies to the much more
frequent lightweight path through it. To fix this, turn the heavyweight
path into a tail call. Yes, introducing another layer of function call
yields much more efficient code!
int f(Buffer *b)
{ if (b->n > 0)
{ /* The usual path through the function... */
/* 95% of all calls.*/
/* Simple calculation involving b->buf, b->n, etc.*/
return ...;
}
/* Exceptional path through the function... */
/* 5% of all calls. */
/* Complicated calculation involving calls
/* to other functions.*/
return ...;
}
int f2(Buffer *b)
{ /* Exceptional path through the function... */
/* 5% of all calls. */
/* Complicated calculation involving calls */
/* to other functions.*/
return ...;
}
If you are lucky, f() will now compile using only a1-a4 and ip and so
incur no entry overhead whatsoever. 95% of the time, the overhead on the
original f() will be reduced to zero.
int f(Buffer *b)
{ if (b->n > 0)
{ /* The usual path through the function... */
/* 95% of all calls.*/
/* Simple calculation involving b->buf, b->n, etc.*/
return ...;]
}
return f2(b);
}
This is quite a general source transformation technique and you should look for opportunities to use it and analogous transformations. It works for any processor to some extent; it works particulary well for the ARM because of the careful optimisation of tail continuation in lightweight functions.
Repeated application of this technique to the chain of six or so functions called for every character processed by the preprocessing phase of the ARM C compiler, improved the performance of the preprocessor (running on the ARM) by about 30%.
Under the ARM Procedure Call Standard, up to four argument words can be passed to a function in registers. Functions of up to four integral (not floating point) arguments are particularly efficient and incur very little overhead beyond that required to compute the argument expressions themselves (there may be a little register juggling in the called function, depending on its complexity).
If more arguments are needed, then the 5th, 6th, etc., words will be passed on the stack. This incurs the cost of an STR in the calling function and an LDR in the called function for each argument word beyond four.
How can argument passing overhead be minimised?
The ARM C compiler has a highly efficient register allocator which operates on complete functions and which tries to allocate the most frequently used variables to registers (taking loop nesting into account). It produces very good results unless the demand for registers seriously outstrips supply. And it has one shortcoming, namely that it allocates whole variables to registers, not separate live ranges.
As code generation proceeds for a function, new variables are created for expression temporaries. These are never reused in later expressions and cannot be spilled to memory. Usually, this causes no problems. However, a particularly pathological expression could, in principle, occupy most of the allocatable registers, forcing almost all program variables to be spilled to memory. Because the number of registers required to evaluate an expression is a logarithmic function of the number terms in it, it takes an expression of more than 32 terms to threaten the use of any variable register.
As a rule of thumb, avoid very large expressions (more than 30 terms).
The more serious problem is with long scope program variables. Our allocator either allocates a variable to a chosen register everywhere the variable is live, or it leaves the variable in memory. To help visualise the problem - and to see how to help the allocator - consider the following two program schemata:
In the left hand case, because the scope of 'i' is the whole function, if
'i' cannot be allocated to a register everywhere then all three loops will
suffer their loop index being in memory. On the other hand, in the right
hand case there are three separate variables called 'i', each of which
will be allocated separately by the register allocator.
int f() int f()
{ int i, j, ...; { int j, ...;
{ int i;
for (i = 0; i < lim; ++i) for (i = 0; i < lim; ++i)
{ {
... ...
} }
}
{ int i;
for (i = 0; i < lim; ++i) for (i = 0; i < lim; ++i)
{ /* register pressure in this {
loop forces 'i' to memory */
} }
}
{ int i;
for (i = 0; i < lim; ++i) for (i = 0; i < lim; ++i)
{ {
... ...
} }
}
} }
As a rule of thumb, keep variable declarations local, especially in large functions. Use additional block structure as illustrated here (right hand example), if necessary.
On the other hand, if this transformation is carried to excess, there may be bad results. When a local variable is spilled to memory, there is a stack adjustment on each entry to and exit from its containing scope. The ARM C compiler does this to minimise the space used by local variables. Suppose, for example, that in the right hand case above, each block declared a 1KB buffer as well as 'i'. Then adjusting the stack at every scope leads to stack usage of just over 1KB whereas adjusting it only at function entry leads to usage of more than 3KB.
In principle, the compiler could be more intelligent about adjusting the stack locally for large variables and only at function entry for small variables. For the moment, the programmer must be aware of these issues.
So, a modified rule of thumb is to cluster variable declarations into reasonable sub-scopes within large functions and to avoid doing so within the most deeply nested loops. This will most likely help the allocator without introducing unwanted costs associated with local stack adjustment.
A static variable, on the other hand, can only be accessed after the static base for the compilation unit has been loaded. So, the first such use in a function always costs 2 LDRs or an LDR and an STR. However, if there are many uses of static variables within a function then there is a good chance that the static base will become a global common subexpression (CSE) and that, overall, access to static variables will be no more expensive than to auto variables on the stack.
Extern variables are fundamentally more expensive: each has its own base pointer. Thus each access to an extern is likely to cost 2 LDRs or an LDR and an STR. It is much less likely that a pointer to an extern will become a global CSE - and almost certain that there cannot be several such CSEs - so if a function accesses lots of extern variables, it is bound to incur significant access costs.
A further cost occurs when a function is called: the compiler has to assume - in the absence of inter-procedural data flow analysis - that any non- const static or extern variable could be side-effected by the call. This severly limits the scope across which the value of a static or extern variable can be held in a register.
Sometimes a programmer can do better than a compiler could do, even a compiler that did interprocedural data flow analysis. An example in C is given by the standard streams: stdin, stdout and stderr. These are not pointers to const objects (the underlying FILE structs are modified by I/O operations), nor are they necessarily const pointers (they may be assignable in some implementations). Nonetheless, a function can almost always safely slave a reference to a stream in a local FILE * variable.
It is a common programming paradigm to mimic the standard streams in applications. Consider, for example, the shape of a typical non-leaf printing function:
In the left hand case, out has be be re-computed or re-loaded after each
call to print_... (and after each fprintf...). In the right hand case, 'f'
can be held in a register throughout the function (and probably will
be).
extern FILE *out; extern FILE *out;
/* the output stream */ /* the output stream */
void print_it(Thing *t) void print_it(Thing *t)
{ { FILE *f = out;
fprintf(out, ...); fprintf(f, ...);
print_1(t->first); print_1(t->first);
fprintf(out, ...); fprintf(f, ...);
print_2(t->second); print_2(t->second);
fprintf(out, ...); fprintf(f, ...);
... ...
} }
Uniform application of this transformation to the disassembly module of the ARM C compiler saved more than 5% of its code space.
In general, it is difficult and potentially dangerous to assert that no function you call (or any functions they in turn call) can affect the value of any static or extern variables of which you currently have local copies. However, the rewards can be considerable so it is usually worthwhile to work out at the program design stage which global variables are slavable locally and which are not. Trying to retrofit this improvement to exisiting code is usually hazardous, except in very simple cases like the above.
In the first role, switch() is hard to improve upon: the ARM C compiler does a good job of deciding when to compile jump tables and when to compile trees of if-then-elses. It is rare for a programmer to be able to improve upon this by writing if-then-else trees explicitly in the source.
In the second role, however, use of switch() is often mistaken. You can probably do better by being more aware of what is being computed and how.
In the example below, which is abstracted from an early version of the disassembly module of the ARM C Compiler, you will learn:
The compiler handles this code fragment well, generating 276 bytes of code
and string literals. But we could do better. If performance were not
critical (as it never is in disassembly) then we could look up the code in
a table of codes, in something like:
char *cond_of_instr(unsigned instr)
{
char *s;`
switch (instr & 0xf0000000)
{
case 0x00000000: s = "EQ"; break;
case 0x10000000: s = "NE"; break;
... ... ...
case 0xF0000000: s = "NV"; break;
}
return s;
}
char *cond_of_instr(unsigned instr)
{
static struct {char name[3]; unsigned code;}
conds[] = {
"EQ", 0x00000000,
"NE", 0x10000000,
....
"NV", 0xf0000000,
};
int j;
for (j = 0; j < sizeof(conds)/sizeof(conds[0]); ++j)
if ((instr & 0xf0000000) == conds[j].code)
return conds[j].name;
return "";
This fragment compiles to 68 bytes of code and 128 bytes of table data.
Already this is a 30% improvement on the switch() case, but this schema
has other advantages: it copes well with a random code to string mapping
and if the mapping is not random admits further optimisation. For example,
if the code is stored in a byte (char) instead of an unsigned and the
comparison is with (instr >> 28) rather than (instr &
0xF0000000) then only 60 bytes of code and 64 bytes of data are generated
for a total of 124 bytes.
}
Another advantage we have heard of for table lookup is that is is possible to share the same table between a disassembler and an assembler - the assembler looks up the mnemonic to obtain the code value, rather than the code value to obtain the mnemonic. Where performance is not critical, the symmetric property of lookup tables can sometimes be exploited to yield significant space savings.
Finally, by exploiting the denseness of the indexing and the uniformity of the returned value it is possible to do better again, both in size and performance, by direct indexing:
This expression of the problem causes a miserly 16 bytes of code and 64
bytes of string literal to be generated and is probably close to what an
experienced assembly language programmer would naturally write if asked to
code this function. It is the solution finally adopted in the ARM C
compiler's disassembler module.
char *cond_of_instr(unsigned instr)
{
return "\
EQ\0\0NE\0\0CC\0\0CS\0\0MI\0\0PL\0\0VS\0\0VC\0\0\
HI\0\0LS\0\0GE\0\0LT\0\0GT\0\0LE\0\0AL\0\0NV" + (instr >> 28)*4;
}
The uniform application of this transformation to the disassembler module of the ARM C compiler saved between 5% and 10% of its code space.
The moral of this tale is to think before using switch() to compute an in-line function, especially if code size is an important consideration. Switch() compiles to high performance code but often table lookup will be smaller; where the function's domain is dense, or piecewise dense, direct indexing into a table will often be both faster and smaller.