Wednesday, November 26, 2008

Frame pointers and Function Call Tracing


Function call tracing is really helpful when tracking down a problem, specially on an embedded system. On the ARM architecture data abort events and such provide useful data regarding where the problem was found like the instruction where that happened and the values each register had on that moment. However that usually isn't enough, specially when functions like memcpy() that are usually called from many places in our program. Even worse: imagine running a RTOS. If we knew where memcpy() was called it would be different and probably a lot easier to debug and trace down. Yesterday I was talking with David and he mentioned the function __builtin_return_address that comes with GCC. He uses it together with C++ (i386) to detect memory leaks, however it's not available for the Arm architecture.

GCC and frame pointers

GCC implements a nice concept called Frame Pointer. One of the CPU's registers is reserved and used as a frame pointer. Each time a function starts its execution the frame pointer is set to point exactly after the return address the caller has placed on the stack. Apart from being useful to the compiler to refer to function arguments in an easier way it can also be used to deduce who called us, and who called the one who called us, and so. A nice explanation can be found here.

However GCC turns on the -fomit-frame-pointer flag for some optimization levels so if we need the frame pointer we need to force it by adding -fno-omit-frame-pointer.
GCC provides a function called __builtin_return_address() with which you can trace up the function calls, but it's not available for ARM. However I found a nice piece of code in the linux kernel for the ARM architecture here. I just stripped this part (remember it's published under the GPL license by Russell King):



.align 0
.type arm_return_addr %function
.global arm_return_addr

arm_return_addr:
mov ip, r0
mov r0, fp
3:
cmp r0, #0
beq 1f //@ frame list hit end, bail
cmp ip, #0
beq 2f //@ reached desired frame
ldr r0, [r0, #-12] // else continue, get next fp
sub ip, ip, #1
b 3b
2:
ldr r0, [r0, #-4] //@ get target return address
1:
mov pc, lr //get back to callee

The 'magic' values are -4 and -12 which indicate the relative position of the previous function's link register (return address) and frame pointer. This values come from analysing the push instruction which is called in every function entry that pushes, apart from other registers, {pc, lr, ip, fp} in that precise order in the stack.
The C prototype for arm_return_addr would be:



void * arm_return_addr( unsigned int num );

Where num is the number of frames to search back. Take care and remember that if you reach the last frame it's value will be 0 and you should stop there.
What I did is to show the call trace once I get data abort or prefetch abort exception so I know what caused it and it's easier to track back.

Give me something I can understand: decrypting addresses

With the function discussed above we can get the return addresses but where to go from there? You can generate a listing with arm-elf-objdump and find the address, but there is a nice tool called arm-elf-addr2line which will do it for you, thanks to David for pointing this out. Just do something like:

    arm-elf-addr2line --exe=yourelf.elf

And it will output something like:

    main.c:90

Which means you can find it in main.c, line 90. Quite nice, isn't it?

Disadvantages

There is a stack penalty when using frame pointers. In one product I'm developing I saw between 20 to 30 words stack penalty when using frame pointers compared to the same program compiled without frame pointers.  That means about 20*4 = 80 bytes of stack. Since that was a RTOSsed product and has more than 10 tasks running simultaneously that number multiplies and yields an important increase in total stack usage. That can be a problem if RAM is not enough, not mentioning that a tight-tuned program will probably crash for the first time it's compiled to work with frame pointers because of stack overflows.

To see what causes this behaviour let's look at a simple C function:



int sumNumbers( int a, int b ) {
return a + b;
}

When compiled with -fomit-frame-pointer we get:



00007f6c <sumNumbers>:
7f6c: e0810000 add r0, r1, r0
7f70: e12fff1e bx lr

But if we force frame pointers with -fno-omit-frame-pointer we obtain:



0000818c <sumNumbers>:
818c: e1a0c00d mov ip, sp
8190: e92dd800 push {fp, ip, lr, pc}
8194: e0810000 add r0, r1, r0
8198: e24cb004 sub fp, ip, #4 ; 0x4
819c: e89da800 ldm sp, {fp, sp, pc}

By using frame pointers GCC is obliged to push the registers fp, ip, lr and pc and set up the frame pointer, that means bigger code and higher stack usage, at least for small or medium sized functions. Now you may notice why the GCC documentations says "-O also turns on -fomit-frame-pointer on machines where doing so does not interfere with debugging."
Using frame pointers or not depends on whether you give priority to debugging or small code footprint (and smaller ram/stack footprint too).

It's important to remember that if the program is compiled without frame pointers then the arm_return_addr function must not be called since the frame pointer register will contain other information, likely not related to anthing to do with frame pointers.

Alternative methods

There is another method we can use with GCC. It involves using the -finstrument-functions compiler flag. That will force GCC to call user-defined functions when entering and exiting functions, so an array can be kept on RAM with the call tree. However that could slow down the whole program excessively. On the other hand care must be taken with multithreaded designs. For more information here is the GCC documentation.

Tuesday, November 25, 2008

Signal conditioning: Mean might be too mean

mean

  • adjective- unkind or spiteful: a mean trick
  • noun- Mathematics: The average value of a set of numbers.

Many times measured analog signals must be filtered before they can be used. Sometimes it's important to remove high frequency components, from a slight low pass filter to something more drastic like constant estimation. We could say we're doing noise reduction by oversampling on many cases.

Averaging
The first method and probably the most intuitive one is by summing several samples and then calculating the mean.

Moving average

The conservative method would be to implement a FIR filter, leading to a moving average filter. This gives us a filtered sample for each incoming measurement. The simplest method is using a non-weighted averaging filter, where each sample gets equal importance. Here there's the bode plot corresponding to a 20-point moving average filter with sampling frequency 10kHz.

It's clear this is not a typical lowpass filter, even though it helps to remove high frequency components. Besides doing that this filter also presents a high attenuation at frequencies multiple of Fsample / N. This can be useful on some cases but it can also be an unwanted effect. Low-pass cut off frequency (-3 dB) is about Fsample / (2*N).

Averaging and subsampling

Another way is to take N samples and calculate the mean of those samples once they've been acquired. Once that happens the whole buffer is cleared and measurements are accumulated again until N new samples are received and the new mean is calculated. This is similar to reducing sampling rate but no lowpass filter is being applied to the signal, so expect to have disastrous results for a non-constant signal, specially when its zero frequency component is comparable to its harmonics. Since the filter is applied by chunks of N samples and then all de previous data is discarded the result depends on how synchronized the input signal is respect to the accumulator present in the averaging algorithm.

This method works nice when estimating a constant value, or a really slow varying one. However, in many cases, we can get better results with a Kalman filter.

What makes this method useful is it's simplicity in both calculations and firmware implementation. All it takes is an accumulator, a count variable and a division once it's full of data.

Kalman Filter - Constant estimation

Using a Kalman filter for constant estimation may look too complicated for such a simple application, but given the measurement noise can be modeled as white and if the variance is known we can obtain very good results. Variance can also be measured, and even better: it could be calculated every certain time or at start up if the signal is known to be constant for some time. Floating point calculations might be needed, although a fixed-point approach can be used too.

There is a good example on Kalman filter constant estimation in this paper by Greg Welch and Gary Bishop.

Low Passing

The other popular and classic method is to apply a low-pass filter to the signal, preserving frequency components up to a certain value. FIR and IIR filters come to mind but I won't discuss this here since it's a whole topic on itself. Using fixed point with these filters is quite easy and straightforward as long as we know we've chosen the right resolution for the fixed point calculations. Otherwise stability problems can arise, specially with IIR ones.

Friday, November 21, 2008

To volatile or not to volatile - Part 2

Hamlet did know about volatile. He just ignored it.

Recalling Hamlet's action on volatileness and the previous post we can 'trick' the compiler when working with ISRs and no nested interrupts enabled.
We can't undeclare a variable or remove the volatile specifier once it has been written before on the same file. However we can write our ISR code in a different source file and declare the variables as global, volatile in the file that uses it outside the ISR and non volatile inside the file containing the ISR.
Here is the resulting code:



// ----- FILE: non_isr.c ------- //
volatile unsigned int var1,var2;

/** Just to use the variables as volatile
* If Enter_Critical() and Exit_Critical() are
* global functions then the volatile specifier above
* could be avoided too
*/
unsigned int getIntSum(void)
{
unsigned int temp;
Enter_Critical();
temp = var1 - var2;
Exit_Critical();

return temp;
}

//----------------------------------

// ----- FILE: isr.c ------- //
unsigned int var1,var2;

void ISR_Handler(void)
{
// do something with var1 and var2
// nested interrupts not enabled,
// thus it's secure to use them as normal variables
}

A big disadvantage when doing this is that variables will be global and we can't declare them static since they have to be shared between several source files.
As said in the previous post, it's only a matter on what you need in that specific situation. In some cases the ISR will result in less execution time but in some others it might be the same or only a small performance boost.

Thursday, November 20, 2008

To volatile or not to volatile

Hamlet never had to worry about volatile specifiers, only existence.

UPDATE:There is a continuation on this topic here

The volatile keyword is something every developer has to deal with when working with interrupts or multiple threads. It's nature is quite simple: when present in a variable declaration it says to the compiler that the current executing code is not the only one which may change its value, so the compiler can't rely on a cached value (on a register for instance) or make any other assumption based on past values. On one hand this means that we should use this keyword if we expect certain variables to behave that way. On the other side it also means that much more code and memory access will be done than when using a normal (non-volatile) variable.

When first using volatile there is a myth about declaring everything shared between interrupts/threads and the main execution path or thread as volatile. Of course it will work but it can lead to slower and larger code and it may not be necessary, specially when coding inside individual functions with critical sections or mutexes/semaphores.

Here is a piece of code, supposed to be a dumb interrupt handler:



extern volatile unsigned int x,y;
void __attribute__((interrupt ("IRQ"))) ISR_test(void)
{
if( x > 0 ) //optimized to ==0 since x is unsigned
{
x = x + y;
}
else
x = 2*y + x;
}


And the corresponding assembly listing for ARM7, compiling with gcc 4.2.2 optimization level 1:

2d7c: push {r1, r2, r3}
2d80: ldr r1, [pc, #68] // r1 = &x;
2d84: ldr r3, [r1] // r3 = x; READ X
2d88: cmp r3, #0 // r3 == 0 ?
2d8c: beq 2da8 // decision
2d90: ldr r2, [r1] // r2 = x; READ X
2d94: ldr r3, [pc, #52] // r2 = &y
2d98: ldr r3, [r3] // r3 = y; READ Y
2d9c: add r3, r3, r2 // r3 = r3 + r2;
2da0: str r3, [r1] // x = r3
2da4: b 2dc4 // ready to return
2da8: ldr r3, [pc, #32] // r3 = &y
2dac: ldr r3, [r3] // r3 = y READ Y
2db0: ldr r1, [pc, #20] // r1 = &x
2db4: ldr r2, [r1] // r2 = x READ X
2db8: lsl r3, r3, #1 // r3 = 2*r3
2dbc: add r3, r3, r2 // r3 = r3 + r2
2dc0: str r3, [r1] // x = r3
2dc4: pop {r1, r2, r3}
2dc8: subs pc, lr, #4 // 0x4
2dcc: .word 0x40004900 // &x
2dd0: .word 0x40004904 // &y

There are three instructions to read x's value and two to read y's. GCC did as we told, do not make any assumptions on the values. However we may know certain constraints which can make the use of the volatile keyword redundant, like knowing that nested interrupts are not enabled. That way nothing will interrupt our ISR routine. The same conclusions get to mind if we use semaphores, mutexes or critical sections whenever a thread tries to access the variables. There are some exceptions I will comment near the ending. If we remove the volatile specifiers from both x and y we get:



2d7c: push {r1, r2, r3}
2d80: ldr r1, [pc, #48] // r1 = &x
2d84: ldr r2, [r1] // r2 = x
2d88: cmp r2, #0 // r2 == 0?
2d8c: ldrne r3, [pc, #40] // only if neq
2d90: ldrne r3, [r3] // only if neq
2d94: addne r3, r3, r2 // only if neq
2d98: strne r3, [r1] // only if neq
2d9c: ldreq r3, [pc, #24] // only if eq
2da0: ldreq r3, [r3] // only if eq
2da4: lsleq r3, r3, #1 // only if eq
2da8: ldreq r2, [pc, #8] // only if eq
2dac: streq r3, [r2] // only if eq
2db0: pop {r1, r2, r3}
2db4: subs pc, lr, #4
2db8: .word 0x40004900
2dbc: .word 0x40004904

Looks like GCC is doing some black magic! The conditional store, add and load instructions help the compiler to avoid branches. The total number of instructions was reduced from 20 to 15. This is a simple example but on a complex one register popping/pushing due to variable volatileness can slow down things even more. As said before, we may need the volatile specifier on certain situations, but if we know when to avoid it then final code can be much more concise.
Recalling the mutexes here is another example:



extern void TakeMutex(void);
extern void GiveMutex(void);

int x,y;

int getSum( void ) {
int temp;
/* This should be in the critical section,
* however it's here to show what happens */
if ( x < 0 )
return 0;
if ( y < 0 )
return 0;

TakeMutex();

temp = x + y;

GiveMutex();

return temp;
}


Note that line 10 and line 12 have atomic reads since this is a 32-bit architecture and int is 32-bit for this compiler. Either one or both of them are unlikely to be true for 8-bit and 16-bit processors, so it's not a good practice if we're looking for portability. Assembly output results in:




2e14: push {r4, r5, lr}
2e18: ldr r5, [pc, #64]
2e1c: ldr r3, [r5]
2e20: cmp r3, #0 ; 0x0
2e24: blt 2e50
2e28: ldr r4, [pc, #52]
2e2c: ldr r3, [r4]
2e30: cmp r3, #0 ; 0x0
2e34: blt 2e50
2e38: bl 2d7c (takemutex)
2e3c: ldr r2, [r4]
2e40: ldr r3, [r5]
2e44: add r4, r2, r3
2e48: bl 2d80 (givemutex)
2e4c: b 2e54
2e50: mov r4, #0 ; 0x0
2e54: mov r0, r4
2e58: pop {r4, r5, lr}
2e5c: bx lr
2e60: .word 0x40004900
2e64: .word 0x40004904

Since TakeMutex() and GiveMutex() are proper functions (defined somewhere else) GCC doesn't know what they will or won't do to x and y, so the code will read them again after the function calls. The only values that were cached are the variable's addresses, which of course won't change.
However, if TakeMutex() and GiveMutex() are macros we may get into trouble:



#define TakeMutex() asm volatile("nop")
#define GiveMutex() asm volatile("nop")

int getSum( void ) {
int temp;
/* This should be in the critical section,
* however it's here to show what happens */
if ( x < 0 )
return 0;
if ( y < 0 )
return 0;

TakeMutex();

temp = x + y;

GiveMutex();

return temp;


I accept that a nop won't protect anything, it's just a snippet. Here is the resulting assembly code:



2dcc: ldr r3, [pc, #48]
2dd0: ldr r2, [r3]
2dd4: cmp r2, #0
2dd8: blt 2dfc
2ddc: ldr r3, [pc, #36]
2de0: ldr r0, [r3]
2de4: cmp r0, #0 ; 0x0
2de8: blt 2dfc
2dec: nop
2df0: add r0, r0, r2
2df4: nop
2df8: bx lr
2dfc: mov r0, #0 ; 0x0
2e00: bx lr
2e04: .word 0x40004900
2e08: .word 0x40004904

Values were cached so it will be a mess, a difficult error to track too. We can avoid this by declaring the variables as volatile.


We can say that we can avoid all this problems by declaring everything 'suspicious' as volatile, but if we were to optimize code and make it really tight while still programming in C then the non-volatile approach is valid too, given that we know what we are doing and the assumptions on the variables' types and values.


UPDATE:There is a continuation on this topic here

Wednesday, November 19, 2008

Writing portable Embedded code - Pin portability

EDIT: I've posted a C++ alternative using templates here.

Portability can be hard on embedded systems. The fact that we can code in C most of the time doesn't mean code is portable. Even worse, it means that you may have to rewrite many functions/macros in order to get the 'same' code to work on another platform/chip.

A nice approach is to write a simple yet powerful HAL (Hardware Abstraction Layer), sometimes called a driver. There are many protocols and peripherals whose functions are nearly the same between different vendors and chips, such as I2C, SPI, UART,MCI controllers. For SPI we would code at least three functions: SPI_init(), SPI_tx() and SPI_rx(). Actually SPI_tx() and SPI_rx() might be the same due to SPI's full duplex capability. There might be other functions or macros to control the chip select line. This approach works nice for standard peripherals, but we may need to access port pins individually to perform some task or communicate to a device by bit-banging.

Most LCD character displays work the same way and use the same protocol and control lines. A 'universal' C module for character LCD control sounds great, but pin compatibility should be addressed first, it's not attractive if we need to change tens of lines to port it.

We'll first define some useful string-concatenating macros:




#define    _CAT3(a,b,c)  a## b ##c

#define    CAT3(a,b,c)   _CAT3(a,b,c)

#define    _CAT2(a,b)    a## b

#define    CAT2(a,b)    _CAT2(a,b)



The macros are called twice to ensure tokens are preprocessed as we want them. Kernighan and Ritchie's wonderful C book has good information about how that works. These macros can be defined in a global header file so they can be included whenever they're needed.

Basic operations on port pins include setting a pin as output or input, clearing a bit, setting a bit and reading it's value when configured as input. Given that I defined another header file which looks like this, specially made for the Philips LPC23xx family (ARM7):




/* Set bit */
#define FPIN_SET(port,bit) CAT3(FIO,port,SET) = (1<<(bit))
#define FPIN_SET_(port_bit) FPIN_SET(port_bit)


/* Clear bit */
#define FPIN_CLR(port,bit) CAT3(FIO,port,CLR) = (1<<(bit))
#define FPIN_CLR_(port_bit) FPIN_CLR(port_bit)


/* Set as input */
#define FPIN_AS_INPUT(port,bit) CAT3(FIO,port,DIR) &=~(1<<(bit))
#define FPIN_AS_INPUT_(port_bit) FPIN_AS_INPUT(port_bit)

/* Set as output */
#define FPIN_AS_OUTPUT(port,bit) CAT3(FIO,port,DIR) |= (1<<(bit))
#define FPIN_AS_OUTPUT_(port_bit) FPIN_AS_OUTPUT(port_bit)


/* when used as input */
#define FPIN_ISHIGH(port,bit) ( CAT3(FIO,port,PIN) & (1<<(bit)) )
#define FPIN_ISHIGH_(port_bit) FPIN_ISHIGH(port_bit)

/* returns !=0 if pin is LOW */
#define FPIN_ISLOW(port,bit) (!( CAT3(FIO,port,PIN)& (1<<(bit)) ))
#define FPIN_ISLOW_(port_bit) FPIN_ISLOW(port_bit)

Done this we can set bit 2.1 by ussuing FPIN_SET(2,1), or clear it by doing FPIN_CLR(2,1). The functions ending with an underscore are meant to be used when pin position is given as a #define macro, such as:



#define LEDA 2,1
#define LEDB 2,1

FPIN_AS_OUTPUT_( LEDA );
FPIN_SET_( LEDA );
FPIN_CLR_( LEDB );




I agree this may sound complicated, but by defining all these functions it's possible to manipulate all port pins easily and in a portable way. If we want to change the pin or port LEDA is using we only need to change it once, the macros will take care of it.

If we were to do the same on an AVR it's a question of changing the macros as shown below. Don't forget ports are named with letters (A,B,C,D...) rather than numbers.




/* Set bit */
#define FPIN_SET(port,bit) CAT2(PORT,port) |= (1<<(bit))
#define FPIN_SET_(port_bit) FPIN_SET(port_bit)


/* Clear bit */
#define FPIN_CLR(port,bit) CAT2(PORT,port) &=~(1<<(bit))
#define FPIN_CLR_(port_bit) FPIN_CLR(port_bit)


/* Set as input */
#define FPIN_AS_INPUT(port,bit) CAT2(DDR,port) &= ~(1<<(bit))
#define FPIN_AS_INPUT_(port_bit) FPIN_AS_INPUT (port_bit)

/* Set as output */
#define FPIN_AS_OUTPUT(port,bit) CAT2(DDR,port) |= (1<<(bit))
#define FPIN_AS_OUTPUT_(port_bit) FPIN_AS_OUTPUT(port_bit)


/* when used as input */
#define FPIN_ISHIGH(port,bit) CAT2(PIN,port) & (1<<(bit)))
#define FPIN_ISHIGH_(port_bit) PIN_ISHIGH(port_bit)


/* returns !=0 if LOW */
#define FPIN_ISLOW(port,bit) (!( CAT2(PIN,port) & (1<<(bit))) )
#define FPIN_ISLOW_(port_bit) FPIN_ISLOW(port_bit)




Now the LCD routines are really portable. Minor changes might be needed if there are other pin registers to modify, but the basic pin functionality is covered by the macros defined above.

Tuesday, November 18, 2008

FreeRTOS - Coding _real_ real-time interrupts



Ever since I discovered the RTOS world I've been amazed how fast, simple and beautiful coding can become.

Using an RTOS for devices with hard real-time constraints can get quite difficult, specially if the RTOS nature is not known by the developer, particularly when dealing with real time interrupts where latency plays an important role and has to be kept as low as possible.

Successful semaphore and queue synchronization (among other RTOS facilities) require interrupts to be disabled during some processing. Interrupt latency will vary depending on how often those resources are checked. If we want to make sure certain interrupts are executed as fastest as posible we must provide them a way to be processed, independently of how the RTOS and our application is behaving.

Things get worse when a TCP/IP stack gets to interact with the RTOS' tasks. As an example, lwIP is able to work with an RTOS like FreeRTOS. To do so we need to let lwIP define critical sections, if we don't we risk system stability.
Other data processing tasks or code might need critical sections too. If that involves disabling interrupts then interrupt latency will increase. If that happens we may lose the first two letters from 'RTOS'.

A trick to overcome this is to use interrupt priorities (if available) so that critical-time interrupts are placed on a priority group, while normal (ie: RTOS) interrupts are on another priority group. As an example, the ARM7 LPC2xxx family from Philips has an interrupt priority mask register where individual interrupt priority groups can be masked or unmasked.

There is an importan issue to consider: don't use RTOS queues/semaphores/etc from interrupts which are not disabled by the RTOS (critical-timing interrupts). There won't be any atomic protections nor critical sections for them.

This might look as if we would be loosing the great advantages of using an RTOS, but usually that can be solved by implementing manual queues if they're needed. Also, if realtime is such an issue on those interrupts, context-switching it's very likely to be a problem too. That means the action is to be taken directly from the interrupt, if possible.

If the interrupt belongs to an encoder the only action to take is to increment or decrement a variable. If you need to provide audio data to a DAC or another peripheral you just send data from an array (usually a double-buffered one) to the corresponding register.

The only change to be made to FreeRTOS is inside the port's code, in particular the macros entitled to disable and re-enable interrupts in portmacro.h. If thumb mode is needed portISR.c needs to be changed too.





#define portDISABLE_INTERRUPTS()    do{ VICSWPrioMask    = (1<<1); } while(0)
#define portENABLE_INTERRUPTS()        do{  VICSWPrioMask    = 0xFFFF; } while(0)




It's important to remember that we don't need to save nor restore any context information from our critical interupt handlers, since they won't interact with the RTOS (at least not directly). This ISR is coded as any ISR without RTOS. That makes it faster than the ISRs that need queue or semaphore management. Of course that real critical sections are needed when sharing information with our ISR, but that can be done by disabling all the interrupts, just like portDISABLE_INTERRUPTS() did before we changed it.

UPDATE: Later I found that portDISABLE_INTERRUPTS() and portENABLE_INTERRUPTS() are not the only macros used for interrupt enabling/disabling. There are other functions named vPortEnterCritical() and vPortExitCritical() which are extensibly used through the FreeRTOS code, so those should be changed too. However, I haven't tried this yet.