Wednesday, March 25, 2009

Loop unrolling and speed-up techniques

Today I started some investigation on DTMF detection. After some research I decided to try the Goertzel algorithm, wrote some code from scratch and tried it on a real equipment based on an ARM7 core. It did fine actually, but after some benchmarking I thought it was somewhat slow.

I wrote the code thinking on future optimization but keeping simplicity in mind, specially because I didn't know how it was going to behave. However, as said above, calculations took much longer than what I expected, even though they were quite fast, exceeding the requirements for real-time processing (telephone channel, 8khz, mono). I still wanted to keep the code in C for portability so I tried some tricks.

I wrote the code with fixed-point math in mind, since floating point arithmetic would kill performance.

First, here are some basic definitions:



/* Struct for a single Goertzel calculator */
typedef struct
{
dsp_t sp1,sp2; //past values (IIR)
dsp_t coeff; //calculated only once

} GOERTZ_s;

/* Calculate Goertzel 'step' */
#define GOERTZ_calc( _s, _val ) \
{ \
dsp_t _gs = (_val) + FP_MULT2( _s.coeff, _s.sp1 ) - _s.sp2; \
_s.sp2 = _s.sp1; \
_s.sp1 = _gs; \
}

The first approach was the simplest one, I needed sixteen Goertzel calculations (8 frequencies + 8 second harmonic for the first frequencies). I had a buffer where the incoming PCM data was copied to so I had to process that array for each of the 16 Goertzel 'calculators'.



static inline void GOERTZ_processAll( dsp_t sample )
{
int i;
for (i=0; i < 16; i++)
GOERTZ_calc( goertzs[i], sample );
}

/* this is inside another function */
int i;
for (i=0; i < BUFFER_SIZE; i++)
GOERTZ_processAll( bufferData[i] );


Where BUFFER_SIZE is the size of bufferData[] and goertzs[] is a 16-element array of GOERTZ_s structs, already initialized. Notice the inline modifier in GOERTZ_processAll(). It's really important since function inlining will help in execution time, specially when the function is called so frequently.

That code worked alright, but was too slow, at least for my intuition. The progress into a more efficient scheme had many steps, but basically there were three important changes:

  • dsp_t was defined as int16_t, but the processor's 'native' word length is 32-bit, so changing dsp_t to be int32_t speeded up the calculations by removing cast and special assignment instructions.
  • Loop unrolling was done in GOERTZ_processAll(). That means more flash is used but it's a good price to pay for a good speed up. Besides that, here we are talking about repeating the same thing 16 times so it's not a big problem if we have, to say, a 512k flash microcontroller or dsp.
  • Instead of passing a single value to GOERTZ_processAll() a 8-byte array is used. This improves time considerably but also increases code size. As said in the previous point that wasn't an issue this case. Note that a 16-byte array won't necessarily increase performance. It depends on the core architecture, and in this case it made things worse (but better than single value parameter passing).

With these modifications I got a 100% performance increase, which means that there can be twice as much DTMF decoders compared to the non-optimized version.

Here is the code with the modifications:



static inline void GOERTZ_processAll( dsp_t samples[8] )
{
#define DO(j) \
GOERTZ_calc( goertzs[j], samples[0] ); \
GOERTZ_calc( goertzs[j], samples[1] ); \
GOERTZ_calc( goertzs[j], samples[2] ); \
GOERTZ_calc( goertzs[j], samples[3] ); \
GOERTZ_calc( goertzs[j], samples[4] ); \
GOERTZ_calc( goertzs[j], samples[5] ); \
GOERTZ_calc( goertzs[j], samples[6] ); \
GOERTZ_calc( goertzs[j], samples[7] );

DO(0);
DO(1);
DO(2);
DO(3);
DO(4);
DO(5);
DO(6);
DO(7);
DO(8);
DO(9);
DO(10);
DO(11);
DO(12);
DO(13);
DO(14);
DO(15);

#undef DO
}

/* somewhere in another function */
int i;
for (i=0; i < BUFFER_SIZE; i +=8 )
GOERTZ_processAll( bufferData + i );

But it can do better...

After observing how good performance went after this modifications I tried to do more loop unrolling to see if I could improve it. If the data buffer is large enough then the for loop that process 8 samples at a time wastes useful CPU instructions, so I included it inside a function called GOERTZ_processBuffer() which takes a buffer pointer and its length. There was another 100% performance increase from the previous optimization. This means a 4x speed up from the original code! Note that, once again, more flash is being used for loop unrolling.



static inline void GOERTZ_processBuffer( dsp_t *samples, int num )
{
#define DO(j) \
for ( i=0; i < num; i+=32 ) { \
GOERTZ_calc( gss[j], samples[0+i] ); \
GOERTZ_calc( gss[j], samples[1+i] ); \
GOERTZ_calc( gss[j], samples[2+i] ); \
GOERTZ_calc( gss[j], samples[3+i] ); \
... up to 32 .. ; }


DO(0);
DO(1);
DO(2);
DO(3);
DO(4);
DO(5);
DO(6);
DO(7);
DO(8);
DO(9);
DO(10);
DO(11);
DO(12);
DO(13);
DO(14);
DO(15);

#undef DO
}

Wednesday, March 18, 2009

QTimer and no monotonic clock support

I found myself dealing with qt-embedded and QTimers again. There's a board whose software configuration doesn't support monotonic clocks so changing the date back in time causes running QTimer's to cease activity.

I chose to patch qt-embedded 4.5.0. The fix is simple, I just had replace this function in src/corelib/kernel/qeventdispatcher_unix.cpp



void QTimerInfoList::registerTimer(int timerId, int interval, QObject *object)
{
QTimerInfo *t = new QTimerInfo;
t->id = timerId;
t->interval.tv_sec = interval / 1000;
t->interval.tv_usec = (interval % 1000) * 1000;
t->timeout = updateCurrentTime() + t->interval;
t->obj = object;
t->inTimerEvent = false;

timerInsert(t);
}

by this one:



void QTimerInfoList::registerTimer(int timerId, int interval, QObject *object)
{
/** add this 2 lines */
updateCurrentTime();
repairTimersIfNeeded();

QTimerInfo *t = new QTimerInfo;
t->id = timerId;
t->interval.tv_sec = interval / 1000;
t->interval.tv_usec = (interval % 1000) * 1000;
t->timeout = updateCurrentTime() + t->interval;
t->obj = object;
t->inTimerEvent = false;

timerInsert(t);
}

After that qt-embedded should be recompiled (a full recompilation isn't needed, the only file that needs to be compiled again is the one modified, and then a full relink and install; make will do the job).

What happens is that everytime a timer is registered, Qt will update the current time and try to fix the timers if it's needed. This will make newly registered timers work after the date is changed backwards, but old timers won't run properly. I created a class which tracks the QTimers present in the application so that it is able to stop and start them again. This is not the best solution but I had to do a quick fix on this and it works.

Wednesday, March 4, 2009

Some code metrics

I recently received an email with Jack Ganssle's last Embedded Muse, mentioning two tools to calculate code metrics. I wanted to try one of them so I downloaded SourceMonitor.

I ran SourceMonitor and computed the statistics for one of the largest projects I'm working on (embedded ARM7 processor, ethernet, USB along with live audio recording/playing among other functions). I got impressed by some numbers, here are the results:


NOTE:SourceMonitor processed my code only, excluding the TCP/IP stack and the FreeRTOS code which is part of the firmware too.







































Files191
Lines34,019
Statements12,765
Percent Branch Statements19.9
Percent Lines with Comments26.1
Functions538
Average Statements per
Function
25.7
Average Block Depth1.54

One of the first things that impressed me was the fact that 26% of the lines are comments, which is something I'm glad of, considering how hard it can be to maintain such a large project, specially if someone else who never got in touch with this code needs to change or add some functionality or worse: correct a bug. However, there is a trick here which may help SourceMonitor to show a percentage way high from an intuitive value: I usually comment functions in Doxygen style, so one to two lines are 'wasted' in spite of code clarity.

Regarding real statements, 12,765 lines mean about 37% of the total line count. This may look like an unbelievable lie, but actually it's due to many blank lines and comments that improve code readability. SourceMonitor's help clearly explains what 'statements' mean for C coding: "Statements: in C, computational statements are terminated with a semicolon character. Branches such as if, for, while, and goto are also counted as statements. Preprocessor directives #include, #define, and #undef are counted as statements. All other preprocessor directives are ignored. In addition all statements between an #else or #elif statement and its closing #endif statement are ignored, to eliminate fractured block"

Another interesting value is the one regarding the average of statements per function. Modularization and splitting code into relatively is a well known good practice that helps to ease code understanding and extension and also bug solving.

There are other metrics not calculated by SourceMonitor which would be nice to investigate, like preprocessor usage (#define, #else, #if, etc) which is something I often abuse of (in a good sense). It would be nice to be able to count the number of defined macros and macro usage too.

It is also important to remember that code metrics is just one side of the code. Lines can be counted and many statistics can be plotted but code organization is not something easy to measure. I could now say that I spent 26% of the time writing comments, which would sound scary to some people. I could also say I spent 37% of the time coding real statements. Both are big lies. Most of the time is spent thinking on how to implement this or that functionality and probably testing it does right (debugging takes time too). Coding is usually quite straight forward once one's brain is organized: that is, of course, not measurable with SourceMonitor.

UPDATE: I downloaded cloc and ran it over the same code, obtaining nearly 7200 blank lines which is about 21%, which I guess should have it's own post ;) The linux kernel 2.6.26 had about 13% of blank lines (see here), I guess I might be overcoding for beauty...