Functions

Fixpoint functions

Fixpoint can be used as a fast approximation to float point. More...

Functions

long long vabs32 (int x, int y)
 Calculate the length and angle of a vector.
long long vrot32 (int x, int y, int a)
 Rotate a vector by an angle.
void fft32c (int *data, const int *coef, unsigned pow2, int idft)
 Complex fast Fourier transform operating on 32-bit fixpoint data.
void fft32r (int *data, const int *coef, unsigned pow2, int idft)
 Real valued fast Fourier transform operating on 32-bit fixpoint data.
void fft16c (short *data, const short *coef, unsigned pow2, int idft)
 Complex fast Fourier transform operating on 16-bit fixpoint data.
void fft16r (short *data, const short *coef, unsigned pow2, int idft)
 Real fast Fourier transform operating on 16-bit fixpoint data.
unsigned sqrt32 (unsigned x)
 Fixpoint square root.
unsigned rec32 (unsigned x)
 Reciprocal of a 32-bit fixpoint number.
int mac32 (int x, int y, int a, unsigned s)
 Fixpoint multiply-accumulate function.
unsigned log32 (unsigned x)
 Base-2 logarithm of a fixpoint number.
unsigned exp32 (unsigned x)
 Base-2 exponential of a fixpoint number.

Detailed Description

Fixpoint can be used as a fast approximation to float point.

These functions can come handy in signal processing and control applications where you use fixpoint format data. Using fixpoint data needs a little getting used to it, but if you keep a good track of your radix point at all times, the fact that you use integer arithmetics instead of float one can save a lot of computing time. Naturally, your precision is not as high as by using doubles and the range of values you can represent is severely restricted, but usually so is your input and output and as long as you keep an eye on the magnitude of intermediate results, there will not be any problems.

If you are not familiar with fixpoint numbers, here is an extremely brief introduction. The idea is that you use integer numbers but you treat them as if they were composed of a whole part and a fractional part. An integer number, like 123 is actually the number 123.000... Nothing stops you to imagine that you move the decimal point to the left by two digits, so when a computation delivers the integer result of 123 you imagine that it actually means 1.23, by "moving" the radix point to the left. Usually the notation one uses is i.f where i is the number of bits that are treated as the integer part of the number and f is the number of bits that are treated as the fractional part of the number. The sum of i and f is the actual word length. Thus, if you use 4.28 format it means that the top 4 bits of the 32-bit word represent the integer part, the bottom 28 bits are the fractional part. In that format 1.0 corresponds to the binary value of 0x10000000 while 0.75 would be 0x0c000000.

As long as your operations are all linear, the position of the radix point is irrelevant. The binary result will be the same regardless of where you imagine the radix point to be. 56+67=123 while 5.6+6.7=12.3, it doesn't matter that you imagined that the decimal point is moved by one digit to the left. No matter how much you move the radix point, it will always work. However, for nonlinear operations that is no longer true. Consider that 56*67=3752 (radix point is after the last digit) but of course, if you move the radix point to the left by one digit in the multiplicands you get 5.6*6.7=32.52, and as you can see the radix point of the result is not at the same place as in the multiplcands. That's where you have to be very careful. There are some nonlinear operators where you can move the radix point in the input and easily work out the new position of the radix point of the output. When you multiply, every time you move the radix 1 bit to the left on the inputs, you will have a result that has its radix point moved by two positions. With square root it is the opposite, moving the radix point of the argument by two places will make the radix point of the result to move by 1 place. The radix point of division remains unmoved when you move the radix of the divident and the divisor by the same amount. There are nonlinear operators, such as the trigonometric functions, where the position of the radix point simply can not be changed. In that case, if you have an argument that is not in the correct format, you have to shift it before passing to the function so that the radix point is at the position where the function expects it. Similarly, you may have to shift the result of such a function to match your preferred radix position. For example, if you have say a sine function that expects a 4.28 input (in radians) and generates the result in the same format, but you otherwise use 2.30 format, then you have to shift you argument to the right by two bits before passing to the function and then shift the result to the left by two bits.

Unlike working with float-point numbers, the fixpoint format requires you to keep track of the radix point at each step. In addition, since your range is limited (in 4.28 signed format the largest number you can represent is 7.999...) you also have to be vigilant about possible overflow conditions. You have to make sure that no partial result is ever outside of the allowed range. This is of course true for all integer calculations as well, but while a 32-bit integer when procesing 8 or 10 bit data provides ample space for the extra bits, when you have maybe 3 or 4 integer bits, you can't just add, say, four numbers together and then divide the result by 4 to calculate their average, because adding the 4 numbers needs 2 extra bits on the left which you may not have. An other thing you have to consider is precision. If you divide a float point number by a power of 2, its precision will not change. It is because multiplication or division by a power of 2 does not touch the mantissa of the number, only its exponent. Your mantissa will contain the same number of significant bits as before. However, the equivalent operation on a fixpoint number is shifting to the right for division or to the left for multiplication. Shifting a fixpont number to the left can easily cause overflow, as detailed above. On the other hand, shifting a fixpoint number to the right will lose bits and thus precision. Therefore, you have to keep an eye on the precision angle of things as well.

Nevertheless, if you are willing to put in the effort, in exchange you will get functionality that normally requires float point calculations while using simple integer arithmetic. The speed increase can be quite significant. Consider that fixpoint addition is actually an integer addition, a single machine instruction executing in 1 clock while a float add takes about 60 clock cycles. This is, of course, the most extreme example but addition is easily the most frequent operation. Multiplication, probably the second most frequent operation, is about 3 times faster for fixpoint operations than for floats, not as stark a contrast as with addition but still a significant advantage.

This collection of functions in the library provides you basic fixed-point functionality. These routines are written in assembly, therefore they run in the same time regardless of whether you compile for the ARM or the THUMB instruction set.

To use these functions you have to include fixpoint.h.

The table below give you an indication of their performance. The execution times were measured on an LPC23xx chip running at 60MHz.

     vrot32() 8.7 µs           log32()  7.9 µs
     vabs32() 8.2 µs           exp32()  6.6 µs
     sqrt32() 3.5 µs           rec32()  2.5 µs
     mac32()  0.5 µs


An other operation quite often used in signal processing is the Fourier transform. The library contains both complex and real valued transforms and their inverses, using either 32 or 16 bit datum. The Fourier routines are all written in assembly, thus the execution speed does not depend on whether you compile for ARM or THUMB. The following table gives you an idea about the execution time for various transform sizes, using 32-bit transform. The 16-bit versions are approximately 10% faster. The times were measured on an LPC23xx running at 60MHz.

       Fourier transform execution time for various N-s, [ms]

     Transform         N = 16    32    64   128   256   512  1024
-----------------------------------------------------------------
2N real <-> N cplx       0.07  0.16  0.34  0.75  1.63  3.52  7.57
 N cplx <-> N cplx       0.06  0.14  0.30  0.67  1.47  3.19  6.90

Function Documentation

unsigned exp32 ( unsigned  x  ) 

Base-2 exponential of a fixpoint number.

The function calculates the base-2 exponential of its argument and returns the result minus 1. The argument is interpreted in 0.32 format, (the radix point is just left the MSB), that is 0x80000000 represents 0.5. Thus, the argument is in the range of (1,0]. The function raises 2 to the power of the argument and returns the fractional part of the result, in the same 0.32 format. The integer part of the result, which is always 1, is discarded.

exp32(x) := 2x - 1

Using this function as the core, it is very easy to write functions that use a different radix point location or a base other than 2.

The result is accurate to 29 bits.

Parameters:
x The function argument
Returns:
The fractional part of the base-2 exponent of x
void fft16c ( short *  data,
const short *  coef,
unsigned  pow2,
int  idft 
)

Complex fast Fourier transform operating on 16-bit fixpoint data.

This function performs a discrete Fourier transformation or inverse discrete Fourier transformation on a block of data. The number of data elements must be a power of two. The elements are represented as complex fixpoint numbers, both the real and imaginary parts occupying a 16-bit word.

This function is equivalent to the fft32c() function, except that both the data and the coefficients are stored in 16-bit shorts instead of 32-bit ints. This has the following consequences:

  • Your data can have at most 13 significant bits, thus, if you interpret it as integers, your range is [-0x1fff,+0x1fff] and if you interpret it as a fixpoint number with 3 integer and 13 fractional bits then your valid range is in (-1.0,+1.0).
  • When generating the sin() table that is passed in the coef argument, #define SCALE should be set to 32768.

For all other aspects the description of fft32c() applies. Note that the array of shorts that is the data must be aligned at an int boundary, the coefficient array can be aligned on a short boundary.

Parameters:
data Pointer to an array of short integers, the array is 2pow2+1 elements long. The array will be transformed in place. Although it is an array of shorts, it must be aligned on a 32-bit word boundary.
coef Pointer to an array of shorts that represent the values of the sin(x) function values between 0 and PI/2. The array has 2pow2-2+1 elements.
pow2 The base-2 logarithm of the number of data points. This number must be at least 2, that is, the smallest transform you can perform is 4 points. Note that the function does not check for validity of this argument, if you pass an incorrect number, the routine will crash.
idft If this argument is zero, then a forward transform is performed, if it is non-zero, then inverse transform is performed.
void fft16r ( short *  data,
const short *  coef,
unsigned  pow2,
int  idft 
)

Real fast Fourier transform operating on 16-bit fixpoint data.

This function performs a discrete Fourier transformation or inverse discrete Fourier transformation on a block of data. The number of data elements must be a power of two. The data elements are represented as fixpoint numbers while the transformed domain is represented as fixpoint complex numbers.

This function is equivalent to the fft32r() function, except that both the data and the coefficients are stored in 16-bit shorts instead of 32-bit ints. This has the following consequences:

  • Your data can have at most 13 significant bits, thus, if you interpret it as integers, your range is [-0x1fff,+0x1fff] and if you interpret it as a fixpoint number with 3 integer and 13 fractional bits then your valid range is in (-1.0,+1.0).
  • When generating the sin() table that is passed in the coef argument, #define SCALE should be set to 32768.

For all other aspects the description of fft32r() applies. Note that the array of shorts that is the data must be aligned at an int boundary, the coefficient array can be aligned on a short boundary.

Parameters:
data Pointer to an array of short integers, the array is 2pow2 elements long. The array will be transformed in place. Although it is an array of shorts, it must be aligned on a 32-bit word boundary.
coef Pointer to an array of shorts that represent the values of the sin(x) function values between 0 and PI/2. The array has 2pow2-2+1 elements. Note that the values are not in simple linearly increasing angular order.
pow2 The base-2 logarithm of the number of real data points. This number must be least 3, that is, the smallest transform you can perform is 8 real (and thus 4 complex) points. Note that the function does not check for validity of this argument, if you pass an incorrect number, the routine will crash.
idft If this argument is zero, then a forward transform is performed, if it is non-zero, then inverse transform is performed.
void fft32c ( int *  data,
const int *  coef,
unsigned  pow2,
int  idft 
)

Complex fast Fourier transform operating on 32-bit fixpoint data.

This function performs a discrete Fourier transformation or inverse discrete Fourier transformation on a block of data. The number of data elements must be a power of two. The elements are represented as complex fixpoint numbers, both the real and imaginary parts occupying a 32-bit word.

Since the Fourier transformation is a linear operator, the position of the radix point is irrelevant. However, there is a restriction that you can have at most 29 significant bits. Therefore, if your radix point is after the LSB of the word, your data is integers in the range of +/- 0x1ffffffff. On the other end, if your radix point is after the third most significant bit of the word, the data represents values between (-1,+1) with 29 fractional bits; you can have anything in between these two extremities.

Mathematically the definition of the Fourier transform and its inverse involves a scale factor. This implementation does not use the proper mathematical scaling. Instead, it does the transform the following way: If you treat your data as fixpoint numbers with 29 bits fractional part and your data is guaranteed to be within (-1,+1), then the forward transform will yield coefficient values in the range of (-1,+1) for F(0) and F(N/2) and (-0.5,+0.5) for all others, in the same fixpont format. If you now inverse transform that array, you will get your original data back (except for round-off errors).

Since the transform operates on fixpoint data, it is important that you provide as many bits as possible, to minimise calculation artefacts. If your data set is say 12-bit integers, you should shift them up by 17 bits before passing it to the function and shift the result down by 17 bits. This will reduce the effect of rounding errors (effectively a random noise that has a magnutide of a few LSBs).

The transformation is in place, the transformation result replaces the data. For the sake of simplicity we will call the original and the transformed space time domain and frequency domain, respectively, so that we can use simple engineering terms. If you have N data points in the time domain, corresponding to 0, t, 2t, 3t, ..., (N-1)t where t is the sampling period and you transform it, you will get the Fourier coefficients at 0, +f, -f, +2f, -2f, .. +(N/2-1)f, -(N/2-1)f and (N/2)f, where f is 1/(N*t). The first element is usually called the DC component while the last element, (N/2)f is the Nyquist frequency, i.e. half of 1/t, the sampling frequency. Since everything is complex, you have an array of 2N integers. The data is arranged in the array as follows (Re and Im denote the real and imaginary parts of a complex number, respectively):

        Array       Time            Freq.
        index       domain          domain
        ----------------------------------
        0           Re 0            Re 0
        1           Im 0            Im 0
        2           Re t            Re +f
        3           Im t            Im +f
        4           Re 2t           Re +2f
        5           Im 2t           Im +2f
        ...         ...             ...
        N-2         Re (N/2-1)t     Re +(N/2-1)f
        N-1         Im (N/2-1)t     Im +(N/2-1)f
        N           Re (N/2)t       Re (N/2)f
        N+1         Im (N/2)t       Im (N/2)f
        N+2         Re (N/2+1)t     Re -(N/2-1)f
        N+3         Im (N/2+1)t     Im -(N/2-1)f
        ...         ...             ...
        2N-4        Re (N-2)t       Re -2f
        2N-3        Im (N-2)t       Im -2f
        2N-2        Re (N-1)t       Re -f
        2N-1        Im (N-1)t       Im -f

To do the transformation, the routine needs a table of the sine values between 0 and PI/2. The table has N/4+1 elements. This table can be constructed by the following simple C routine:

    #include <math.h>

    #define PIo2    1.570796326794896619256404479703093102216
    #define SCALE   1073741824


    void fft32c_table_gen( int *table, int pow2 )
    {
    int     i, n;
    double  s;

        n = 1 << ( pow2 - 2 );
        s = PIo2 / n;

        for ( i = 0 ; i <= n ; i++ )

            table[ i ] = round( sin( s * i ) * SCALE );
    }

This table is consulted in the innermost loop of the transform. Therefore, the difference between the access cost of data stored in internal RAM or in the FLASH will have a direct, perceptible impact on the speed of the function. It depends on the chip in question, on an LPC23xx chip you can expect a 10-15% speed penalty if you store the table in FLASH.

This is a complex transform; if you want to transform purely real data, you should look at the fft32r() function which transforms between real time and complex frequency domains.

Parameters:
data Pointer to an array of integers, the array is 2pow2+1 elements long. The array will be transformed in place.
coef Pointer to an array of integers that represent the values of the sin(x) function values between 0 and PI/2. The array has 2pow2-2+1 elements.
pow2 The base-2 logarithm of the number of data points. This number must be at least 2, that is, the smallest transform you can perform is 4 points. Note that the function does not check for validity of this argument, if you pass an incorrect number, the routine will crash.
idft If this argument is zero, then a forward transform is performed, if it is non-zero, then inverse transform is performed.
void fft32r ( int *  data,
const int *  coef,
unsigned  pow2,
int  idft 
)

Real valued fast Fourier transform operating on 32-bit fixpoint data.

This function performs a discrete Fourier transformation or inverse discrete Fourier transformation on a block of data. The number of data elements must be a power of two. The data elements are represented as fixpoint numbers while the transformed domain is represented as fixpoint complex numbers.

The restrictions on the data format are identical to that of the fft32c() function, as are the scaling properties, except that for this function the resulting coefficients are all in the (+1,-1) range, that is, there is no difference between the 0th and (N/2)f coefficients and the rest.

The transformation is in place, the transformation result replaces the data. Using the same terms as for the fft32c() function, the data arrangement in memory is the following; note that for real-valued transforms the imaginary part of the DC and the Nyquist frequency coefficients are always zero, thus omitted:

        Array       Time            Freq.
        index       domain          domain
        ----------------------------------
        0           0               Re 0
        1           t               Re +(N/2)f
        2           2t              Re +f
        3           3t              Im +f
        4           4t              Re +2f
        5           5t              Im +2f
        ...         ...             ...
        N-4         (N-4)t          Re +(N/2-2)f
        N-3         (N-3)t          Im +(N/2-2)f
        N-2         (N-2)t          Re +(N/2-1)f
        N-1         (N-1)t          Im +(N/2-1)f

To do the transformation, the routine needs a table of the sine values between 0 and PI/2. The table has N/4+1 elements. This table can be constructed by the following simple C routine, note that it is different from the one used for fft32c():

    #include <math.h>

    #define PIo2    1.570796326794896619256404479703093102216
    #define SCALE   1073741824


    void fft32r_table_gen( int *table, int pow2 )
    {
    int     i, n;
    double  s, o;

        n = 1 << ( pow2 - 3 );
        s = PIo2 / n;
        o = -s / 2;

        for ( i = 0 ; i <= n ; i++ )

            table[ i ] = round( sin( s * i ) * SCALE );

        for ( i = 1 ; i < n ; i++ )

            table[ n + i ] = round( sin( s * i + o ) * SCALE );
    }

The effect of storing the table in RAM or in FLASH is the same as with the fft32c() function.

This function handles only real data in the time domain, if you need complex to complex Fourier transform, you should use the fft32c() function.

Parameters:
data Pointer to an array of integers, the array is 2pow2 elements long. The array will be transformed in place.
coef Pointer to an array of integers that represent the values of the sin(x) function values between 0 and PI/2. The array has 2pow2-2+1 elements. Note that the values are not in simple linearly increasing angular order.
pow2 The base-2 logarithm of the number of real data points. This number must be least 3, that is, the smallest transform you can perform is 8 real (and thus 4 complex) points. Note that the function does not check for validity of this argument, if you pass an incorrect number, the routine will crash.
idft If this argument is zero, then a forward transform is performed, if it is non-zero, then inverse transform is performed.
unsigned log32 ( unsigned  x  ) 

Base-2 logarithm of a fixpoint number.

The function calculates the base-2 logarithm of its argument plus 1. The argument is interpreted in 0.32 format and is in the range of (1,0]. The function then adds 1.0 to the argument and calculates the logarithm of that number. The value returned is in the same 0.32 format.

log32(x) := log2( x + 1 )

Using this function you can easily build functions that can calculate the logarithm of numbers with arbitrary radix point location. Also, changing the base of the logarithm to values other that 2 is trivial. The ldzcnt() and umul32x32_64() functions will prove to be very useful for that.

The result is accurate to 29 bits.

Parameters:
x The function argument
Returns:
The base-2 logarithm of 1+x
int mac32 ( int  x,
int  y,
int  a,
unsigned  s 
)

Fixpoint multiply-accumulate function.

The function multiplies its first two arguments as signed fixpoint numbers and then adds the third argument to the product and returns the result. The three arguments can all have different radix position. The result has the same radix position as that of the third argument.

The last argument tells the function how to format the multiplication result to mach the format of the additive argument. The number represents the sum of the number of integer bits in the multiplicands, minus the number of integer bits in the accumulator. For example, if all numbers are in 4.28 format then the last argument should be (4+4)-4 = 4. If one of the multiplicands is an integer (that is, it is in 32.0 format) while the other multiplicand is a 8.24 format and the accumulator is in 16.16 format then the last argumen should be (32+8)-16 = 24. Note that this number must be between 32 and 0.

Parameters:
x One of the multiplicands
y The other multiplicand
a The additive tag
s The scale factor, the difference between the number of integer bits in the multiplication result and in the additive argument, between 0 and 32.
Returns:
The value of a + SCALE_AND_ROUND( x * y )
unsigned rec32 ( unsigned  x  ) 

Reciprocal of a 32-bit fixpoint number.

This function calculates the reciprocal of a number. The argument is in 0.32 format (that is, the radix point is just left of its MSB) and it must be in the range of (1/4,1). The result is in 2.30 format (the radix point is between bit 30 and bit 29) and is in the range of [1,4). Note that an argument of 1/4 (corresponding to the 32-bit integer value of 0x40000000) or less generates a result that is not representable in the 2.30 format, therefore the function will return garbage.

The result is accurate to 31 significant bits.

Parameters:
x A number in 0.32 format, in the range of (1/4,1)
Returns:
1/x, in 2.30 format, in the range of [1,4)
unsigned sqrt32 ( unsigned  x  ) 

Fixpoint square root.

This function takes a number, which is in 1.31 fixpoint format, that is, the radix point is after the MSB of the word. The return value is the square root of that number, in the same format.

The result is accurate to 31 significant bits

Parameters:
x The number to calculate the square root of, it should be in 1.31 fixpoint format and within the range of [0.0,2.0).
Returns:
The square root of x, in the same 1.31 format and in the range of [0.0,sqrt(2))
long long vabs32 ( int  x,
int  y 
)

Calculate the length and angle of a vector.

This function processes a vector presented in Cartesian format and returns its length and angle. That is, it transforms the Cartesian represenation to polar form. The X and Y coordinates can be integers or fixpoint numbers with the radix point at any place; the only restriction is that they must have at most 28 significant bits. Thus, the most significant 4 bits must be all 0s or all 1s. If that criterion is not met, then the returned values will be garbage.

The function returns the length of the vector, in the same format as X and Y were presented. The length is always positive. Furthermore, the function returns the angle of the vector. The returned angle is a fix-point number which represents the angle in radians, multiplied by 228. That is, the radix point is after the 4th most significant bit of the 32 bit integer. The returned value is in [-PI,+PI].

In effect, the function calculates:

    length = sqrt( x*x + y*y ); // Length of the vector
    angle  = atan2( y, x );     // Qaudrant correct angle of the vector

The two returned values are concatenated into a 64-bit integer, in that way the use of a structure can be avoided. To decompose the result the following simple code fragment can be used:

    int x, y;               // The vector
    int angle, length;      // The returned values
    long long temp;         // A temporary variable

    temp = vabs32( x, y );
    angle = temp >> 32;     // The angle is in the upper 32 bits
    length = temp;          // The length is in the lower 32 bits

The result is accurate to 27 bits.

Parameters:
x The X coordinate of the vector
y The Y coordinate of the vector
Returns:
A 64-bit integer that in its upper 32 bits contains the angle of the vector, in signed 4.28 format and the lower 32 bits is the length of the vector, in the same format as the X and Y coordinates were. The length of the vector is always positive.
long long vrot32 ( int  x,
int  y,
int  a 
)

Rotate a vector by an angle.

This function rotates a vector by a given angle. The angle is represented by a signed fixpoint number, where the radix point is after the 4th most significant bit. The angle is expressed in radians. Any angle representable in the 4.28 format is allowed.

The vector components are signed integers (or fixpoint numbers, but the position of the radix point is up to you). Up to 28 significant bits are allowed, that is, the top 4 bits must be either all 0s or all 1s. If that's not the case, there will be internal overflows and you will get garbage in the result.

You can use this function to rotate a vector by a given angle, or, if you set x to the length of the vector and y to 0, to transform from polar to Cartesian representation of a vector or, by choosing the initial vector of (1,0) in your preferred representation, to simultaneously calculate the sine and the cosine of an angle.

In effect, the function calculates:

    x_out = x * cos( a ) - y * sin( a );
    y_out = x * sin( a ) + y * cos( a );

The returned X and Y valueas are packed into a 64-bit integer. To decompose the result into its components, the following code fragment can be used:

    int x, y, a;
    long long t;

    t = vrot32( x, y, a );
    x = t;
    y = t >> 32;

The result is accurate to 27 bits.

Parameters:
x The X coordinate of the vector to be rotated
y The Y coordinate of the vector to be rotated
a Rotational angle, in radians, multiplied by 228. This format allows you to represent angles in (-8,+8), which is rougly -458 to +485 degrees, that is, more than a full circle in each direction. The angular resolution is 2-28 radians, that is, about 0.0008 seconds of arc.
Returns:
A 64-bit integer that contains the rotated vector's X coordinate on its lower 32 bits and the Y coordinate on its upper 32 bits.