Spru565b - TMS320C64x DSP Library Programmer's Reference
Spru565b - TMS320C64x DSP Library Programmer's Reference
Programmer’s Reference
Texas Instruments Incorporated and its subsidiaries (TI) reserve the right to make corrections, modifications,
enhancements, improvements, and other changes to its products and services at any time and to discontinue
any product or service without notice. Customers should obtain the latest relevant information before placing
orders and should verify that such information is current and complete. All products are sold subject to TI’s terms
and conditions of sale supplied at the time of order acknowledgment.
TI warrants performance of its hardware products to the specifications applicable at the time of sale in
accordance with TI’s standard warranty. Testing and other quality control techniques are used to the extent TI
deems necessary to support this warranty. Except where mandated by government requirements, testing of all
parameters of each product is not necessarily performed.
TI assumes no liability for applications assistance or customer product design. Customers are responsible for
their products and applications using TI components. To minimize the risks associated with customer products
and applications, customers should provide adequate design and operating safeguards.
TI does not warrant or represent that any license, either express or implied, is granted under any TI patent right,
copyright, mask work right, or other TI intellectual property right relating to any combination, machine, or process
in which TI products or services are used. Information published by TI regarding third-party products or services
does not constitute a license from TI to use such products or services or a warranty or endorsement thereof.
Use of such information may require a license from a third party under the patents or other intellectual property
of the third party, or a license from TI under the patents or other intellectual property of TI.
Reproduction of information in TI data books or data sheets is permissible only if reproduction is without
alteration and is accompanied by all associated warranties, conditions, limitations, and notices. Reproduction
of this information with alteration is an unfair and deceptive business practice. TI is not responsible or liable for
such altered documentation.
Resale of TI products or services with statements different from or beyond the parameters stated by TI for that
product or service voids all express and any implied warranties for the associated TI product or service and
is an unfair and deceptive business practice. TI is not responsible or liable for any such statements.
Following are URLs where you can obtain information on other Texas Instruments products and application
solutions:
Products Applications
Amplifiers amplifier.ti.com Audio www.ti.com/audio
Data Converters dataconverter.ti.com Automotive www.ti.com/automotive
DSP dsp.ti.com Broadband www.ti.com/broadband
Interface interface.ti.com Digital Control www.ti.com/digitalcontrol
Logic logic.ti.com Military www.ti.com/military
Power Mgmt power.ti.com Optical Networking www.ti.com/opticalnetwork
Microcontrollers microcontroller.ti.com Security www.ti.com/security
Telephony www.ti.com/telephony
Video & Imaging www.ti.com/video
Wireless www.ti.com/wireless
Notational Conventions
- Macro names are written in uppercase text; function names are written in
lowercase.
The following books describe the TMS320C6x devices and related support
tools. To obtain a copy of any of these TI documents, call the Texas Instru-
ments Literature Response Center at (800) 477-8924. When ordering, please
identify the book by its title and literature number. Many of these documents
can be found on the Internet at http://www.ti.com.
iv
Trademarks
and program memories, the external memory interface (EMIF), the host
port interface (HPI), multichannel buffered serial ports (McBSPs), direct
memory access (DMA), enhanced DMA (EDMA), expansion bus,
clocking and phase-locked loop (PLL), and the power-down modes.
Trademarks
TMS320C6000, TMS320C64x, TMS320C62x, and Code Composer Studio
are trademarks of Texas Instruments.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Provides a brief introduction to the TI C64x DSPLIB, shows the organization of the routines con-
tained in the library, and lists the features and benefits of the DSPLIB.
1.1 Introduction to the TI C64x DSPLIB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.2 Features and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
vii
Contents
DSP_radix2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
DSP_r4fft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
DSP_fft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
DSP_fft16x16r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
DSP_fft16x16t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-33
DSP_fft16x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-47
DSP_fft32x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-49
DSP_fft32x32s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-51
DSP_ifft16x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-53
DSP_ifft32x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-55
4.4 Filtering and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-57
DSP_fir_cplx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-57
DSP_fir_gen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-59
DSP_fir_r4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-61
DSP_fir_r8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-63
DSP_fir_sym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-65
DSP_iir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-67
DSP_iirlat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-69
4.5 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-71
DSP_dotp_sqr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-71
DSP_dotprod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-73
DSP_maxval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-75
DSP_maxidx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-76
DSP_minval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-78
DSP_mul32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-79
DSP_neg32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-81
DSP_recip16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-82
DSP_vecsumsq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-84
DSP_w_vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-85
4.6 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-86
DSP_mat_mul . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-86
DSP_mat_trans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-88
4.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-89
DSP_bexp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-89
DSP_blk_eswap16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-91
DSP_blk_eswap32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-93
DSP_blk_eswap64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-95
DSP_blk_move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-97
DSP_fltoq15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-98
DSP_minerror . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-99
DSP_q15tofl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-101
viii
Contents
C Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Defines terms and abbreviations used in this book.
Contents ix
Tables
Tables
2−1 DSPLIB Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
3−1 Argument Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3−2 Adaptive Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3−3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3−4 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3−5 Filtering and Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
3−6 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3−7 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3−8 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
A−1 Q3.12 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A−2 Q.15 Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A−3 Q.31 Low Memory Location Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A−4 Q.31 High Memory Location Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
x
Chapter 1
Introduction
Topic Page
1-1
Introduction to the TI C64x DSPLIB
The TI DSPLIB includes commonly used DSP routines. Source code is pro-
vided that allows you to modify functions to match your specific needs.
The routines contained in the library are organized into the following seven dif-
ferent functional categories:
- Adaptive filtering
J DSP_firlms2
- Correlation
J DSP_autocor
- FFT
1-2
Introduction to the TI C64x DSPLIB
- Math
J DSP_dotp_sqr
J DSP_dotprod
J DSP_maxval
J DSP_maxidx
J DSP_minval
J DSP_mul32
J DSP_neg32
J DSP_recip16
J DSP_vecsumsq
J DSP_w_vec
- Matrix
J DSP_mat_mul
J DSP_mat_trans
- Miscellaneous
J DSP_bexp
J DSP_blk_eswap16
J DSP_blk_eswap32
J DSP_blk_eswap64
J DSP_blk_move
J DSP_fltoq15
J DSP_minerror
J DSP_q15tofl
Introduction 1-3
Features and Benefits
1-4
Chapter 2
This chapter provides information on how to install and rebuild the TI C64x
DSPLIB.
Topic Page
2-1
How to Install DSPLIB
Note:
You should read the README.txt file for specific details of the release.
The lib directory contains the library archive and the source archive. Please
install the contents of the lib directory in a directory pointed by your C_DIR en-
vironment. If you choose to install the contents in a different directory, make
2-2
How to Install DSPLIB
sure you update the C_DIR environment variable, for example, by adding the
following line in autoexec.bat file:
SET C_DIR=<install_dir>/lib;<install_dir>/include;%C_DIR%
or under Unix/csh:
setenv C_DIR ”<install_dir>/lib;<install_dir>/include;
$C_DIR”
If you set up a project under Code Composer Studio, you could add DSPLIB by
choosing dsp64x.lib from the menu Project → Add Files to Project. Also, you
should make sure that you link with the correct run-time support library and
DSPLIB by having the following lines in your linker command file:
−lrts6400.lib
−ldsp64x.lib
The include directory contains the header files necessary to be included in the
C code when you call a DSPLIB function from C code.
Size
Name (bits) Type Minimum Maximum
short 16 integer −32768 32767
TI DSPLIB functions typically operate over vector operands for greater effi-
ciency. Even though these routines can be used to process short arrays, or
even scalars (unless a minimum size requirement is noted), they will be slower
for these cases.
2-4
Using DSPLIB
In addition to correctly installing the DSPLIB software, you must follow these
steps to include a DSPLIB function in your code:
- Use a correct linker command file for the platform you use. Remember
most functions in dsp64x.lib are written assuming little-endian mode of op-
eration.
For example, if you want to call the Autocorrelation DSPLIB function, you
would add:
#include <DSP_autocor.h>
The C64x DSPLIB functions were written to be used from C. Calling the func-
tions from assembly language source code is possible as long as the calling
function conforms to the Texas Instruments C64x C compiler calling conven-
tions. For more information, refer to Section 8 (Runtime Environment) of
TMS320C6000 Optimizing C Compiler User’s Guide (SPRU187).
DSPLIB is tested under the Code Composer Studio environment against a ref-
erence C implementation. You can expect identical results between Reference
C implementation and its Assembly implementation when using test routines
that deal with fixed-point type results. The test routines that deal with floating
points typically allow an error margin of 0.000001 when comparing the results
of reference C code and DSPLIB assembly code.
In FFT functions, twiddle factors are generated with a fixed scale factor, i.e.,
32767(=2^15−1) for all 16-bit FFT functions, 1073741823(=2^30−1) for
DSP_fft32x32s, 2147483647(=2^31−1) for all other 32-bit FFT functions.
Twiddle factors cannot be scaled further to not scale input data. Since
DSP_fft16x16r and DSP_fft32x32s performs scaling by 2 at each radix-4
stage, the input data must be scaled by 2^ (log2(nx)−cei[log4(nx)−1]) to com-
pletely prevent overflow. In all other FFT functions, the input data must be
scaled by 2^ (log2(nx)) since no scaling is done by the functions.
2-6
Using DSPLIB
All of the functions in this library are designed to be used in systems with inter-
rupts. That is, it is not necessary to disable interrupts when calling any of these
functions. The functions in the library will disable interrupts as needed to pro-
tect the execution of code in tight loops and so on. Functions in this library fall
into three categories:
Note that all three categories of function tolerate interrupts. That is, an interrupt
can occur at any time without affecting the correctness of the function. The in-
terruptibility of the function only determines how long the kernel might delay
the processing of the interrupt.
2-8
Chapter 3
This chapter provides tables containing all DSPLIB functions, a brief descrip-
tion of each, and a page reference for more detailed information.
Topic Page
3-1
Arguments and Conventions Used
Argument Description
x,y Argument reflecting input data vector
nx,ny,nr Arguments reflecting the size of vectors x,y, and r, respectively. For
functions in the case nx = ny = nr, only nx has been used across.
3-2
DSPLIB Functions
- Adaptive filtering
- Correlation
- FFT
- Math
- Matrix functions
- Miscellaneous
void DSP_radix2 (int nx, short *x, short *w) Complex Forward FFT (radix 2) 4-9
void DSP_r4fft (int nx, short *x, short *w) Complex Forward FFT (radix 4) 4-11
void DSP_fft(short *w, int nx, short *x, short *y) Complex out of place, Forward 4-14
FFT (radix4) with digit reversal.
void DSP_fft16x16r(int nx, short *x, short *w, unsigned Cache-optimized mixed radix FFT 4-23
char *brev, short *y, int radix, int offset, int n_max) with scaling and rounding, digit
reversal, out of place. Input and
output: 16 bits, Twiddle factor: 16
bits
void DSP_fft16x16t(short *w, int nx, short *x, short *y) Mixed radix FFT with truncation, 4-33
digit reversal, out of place. Input
and output: 16 bits, Twiddle fac-
tor: 16 bits
void DSP_fft16x32(short *w, int nx, int *x, int *y) Extended precision, mixed radix 4-47
FFT, rounding, digit reversal, out
of place. Input and output: 32 bits,
Twiddle factor: 16 bits
void DSP_fft32x32(int *w, int nx, int *x, int *y) Extended precision, mixed radix 4-49
FFT, rounding, digit reversal, out
of place. Input and output: 32 bits,
Twiddle factor: 32 bits
void DSP_fft32x32s(int *w, int nx, int *x, int *y) Extended precision, mixed radix 4-51
FFT, digit reversal, out of place.,
with scaling and rounding. Input
and output: 32 bits, Twiddle fac-
tor: 32 bits
3-4
DSPLIB Function Tables
void DSP_ifft32x32(int *w, int nx, int *x, int *y) Extended precision, mixed radix 4-55
IFFT, digit reversal, out of place.,
with scaling and rounding. Input
and output: 32 bits, Twiddle fac-
tor: 32 bits
void DSP_fir_gen (short *x, short *h, short *r, int nh, int nr) FIR Filter (any nh) 4-59
void DSP_fir_r4 (short *x, short *h, short *r, int nh, int nr) FIR Filter (nh is a multiple of 4) 4-61
void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr) FIR Filter (nh is a multiple of 8) 4-63
void DSP_fir_sym (short *x, short *h, short *r, int nh, int nr, Symmetric FIR Filter (nh is a mul- 4-65
int s) tiple of 8)
void DSP_iir(short *r1, short *x, short *r2, short *h2, short IIR with 5 Coefficients 4-67
*h1, int nr)
void DSP_iirlat(short *x, int nx, short *k, int nk, int *b, All−pole IIR Lattice Filter 4-69
short *r)
int DSP_dotprod(short *x, short *y, int nx) Vector Dot Product 4-73
short DSP_maxval (short *x, int nx) Maximum Value of a Vector 4-75
int DSP_maxidx (short *x, int nx) Index of the Maximum Element of 4-76
a Vector
short DSP_minval (short *x, int nx) Minimum Value of a Vector 4-78
void DSP_mul32(int *x, int *y, int *r, short nx) 32-bit Vector Multiply 4-79
void DSP_neg32(int *x, int *r, short nx) 32-bit Vector Negate 4-81
void DSP_recip16 (short *x, short *rfrac, short *rexp, short 16-bit Reciprocal 4-82
nx)
void DSP_w_vec(short *x, short *y, short m, short *r, short Weighted Vector Sum 4-85
nr)
void DSP_mat_trans(short *x, short rows, short columns, Matrix Transpose 4-88
short *r)
3-6
DSPLIB Function Tables
void DSP_blk_eswap16(void *x, void *r, int nx) Endian-swap a block of 16-bit 4-91
values
void DSP_blk_eswap32(void *x, void *r, int nx) Endian-swap a block of 32-bit 4-93
values
void DSP_blk_eswap64(void *x, void *r, int nx) Endian-swap a block of 64-bit 4-95
values
void DSP_blk_move(short *x, short *r, int nx) Move a Block of Memory 4-97
void DSP_fltoq15 (float *x,short *r, short nx) Float to Q15 Conversion 4-98
int DSP_minerror (short *GSP0_TABLE,short *errCoefs, Minimum Energy Error Search 4-99
int *savePtr_ret)
void DSP_q15tofl (short *x, float *r, short nx) Q15 to Float Conversion 4-101
DSPLIB Reference
This chapter provides a list of the functions within the DSP library (DSPLIB)
organized into functional categories. The functions within each category are
listed in alphabetical order and include arguments, descriptions, algorithms,
benchmarks, and special requirements.
Topic Page
4-1
DSP_firlms2
Function long DSP_firlms2(short * restrict h, const short * restrict x, short b, int nh)
Description The Least Mean Square Adaptive Filter computes an update of all nh coeffi-
cients by adding the weighted error times the inputs to the original coefficients.
The input array includes the last nh inputs followed by a new single sample
input. The coefficient array includes nh coefficients.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
long DSP_firlms2(short h[ ],short x[ ], short b,
int nh)
{
int i;
long r = 0;
for (i = 0; i < nh; i++) {
h[i] += (x[i] * b) >> 15;
r += x[i + 1] * h[i];
}
return r;
}
Special Requirements
- This routine assumes 16-bit input and output.
4-2
DSP_firlms2
Implementation Notes
- The loop is unrolled 4 times.
4.2 Correlation
DSP_autocor Autocorrelation
Function void DSP_autocor(short * restrict r, const short * restrict x, int nx, int nr)
Description This routine accepts an input array of length nx + nr and performs nr autocor-
relations each of length nx producing nr output results. This is typically used
in VSELP code.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_autocor(short r[ ],short x[ ], int nx, int nr)
{
int i,k,sum;
for (i = 0; i < nr; i++){
sum = 0;
for (k = nr; k < nx+nr; k++)
sum += x[k] * x[k−i];
r[i] = (sum >> 15);
}
}
Special Requirements
- nx must be a multiple of 8.
- nr must be a multiple of 4.
Implementation Notes
- The inner loop is unrolled 8 times.
4-4
DSP_autocor
- The outer loop is conditionally executed in parallel with the inner loop. This
allows for a zero overhead outer loop.
4.3 FFT
NOTE: This function is provided for backward compatibility with the C62x
DSPLIB. It has not been optimized for the C64x architecture. The user is ad-
vised to use one of the newly added FFT functions which have been optimized
for the C64x.
Description This function bit-reverses the position of elements in complex vector x. This
function is used in conjunction with FFT routines to provide the correct format
for the FFT input or output data. The bit-reversal of a bit-reversed order array
yields a linear-order array.
Algorithm TI retains all rights, title and interest in this code and only authorizes the use
of this code on TI TMS320 DSPs manufactured by TI. This is the C equivalent
of the assembly code without restrictions. Note that the assembly code is hand
optimized and restrictions may apply.
void DSP_bitrev_cplx (int *x, short *index, int nx)
{
int i;
short i0, i1, i2, i3;
short j0, j1, j2, j3;
int xi0, xi1, xi2, xi3;
int xj0, xj1, xj2, xj3;
short t;
int a, b, ia, ib, ibs;
int mask;
int nbits, nbot, ntop, ndiff, n2, halfn;
short *xs = (short *) x;
4-6
DSP_bitrev_cplx
nbits = 0;
i = nx;
while (i > 1){
i = i >> 1;
nbits++;}
nbot = nbits >> 1;
ndiff = nbits & 1;
ntop = nbot + ndiff;
n2 = 1 << ntop;
mask = n2 − 1;
halfn = nx >> 1;
for (i0 = 0; i0 < halfn; i0 += 2) {
b = i0 & mask;
a = i0 >> nbot;
if (!b) ia = index[a];
ib = index[b];
ibs = ib << nbot;
j0 = ibs + ia;
t = i0 < j0;
xi0 = x[i0];
xj0 = x[j0];
if (t){x[i0] = xj0;
x[j0] = xi0;}
i1 = i0 + 1;
j1 = j0 + halfn;
xi1 = x[i1];
xj1 = x[j1];
x[i1] = xj1;
x[j1] = xi1;
i3 = i1 + halfn;
j3 = j1 + 1;
xi3 = x[i3];
xj3 = x[j3];
if (t){x[i3] = xj3;
x[j3] = xi3;}
}
}
Special Requirements
- nx must be a power of 2.
- If nx ≤ 4K, one can use the char (8-bit) data type for the “index” variable.
This would require changing the LDH when loading index values in the as-
sembly routine to LDB. This would further reduce the size of the Index
Table by half its size.
Implementation Notes
- Endian: The code is LITTLE ENDIAN.
Benchmarks The performance of this function has not yet been characterized on the C64x.
4-8
DSP_radix2
NOTE: This function is provided for backward compatibility with the C62x
DSPLIB. It has not been optimized for the C64x architecture. The user is ad-
vised to use one of the newly added FFT functions which have been optimized
for the C64x.
Function void DSP_radix2 (int nx, short * restrict x, const short * restrict w)
Description This routine is used to compute FFT of a complex sequence of size nx, a power
of 2, with “decimation-in-frequency decomposition” method. The output is in
bit-reversed order. Each complex value is with interleaved 16-bit real and
imaginary parts.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_radix2 (short x[ ],short nx,short w[ ])
{
short n1,n2,ie,ia,i,j,k,l;
short xt,yt,c,s;
n2 = nx;
ie = 1;
for (k=nx; k > 1; k = (k >> 1) ) {
n1 = n2;
n2 = n2>>1;
ia = 0;
for (j=0; j < n2; j++) {
c = w[2*ia];
s = w[2*ia+1];
ia = ia + ie;
for (i=j; i < nx; i += n1) {
l = i + n2;
xt = x[2*l] − x[2*i];
x[2*i] = x[2*i] + x[2*l];
yt = x[2*l+1] − x[2*i+1];
x[2*i+1] = x[2*i+1] + x[2*l+1];
x[2*l] = (c*xt + s*yt)>>15;
x[2*l+1] = (c*yt − s*xt)>>15;
}
}
ie = ie<<1;
}
}
Special Requirements
- 2 ≤ nx ≤ 32768 (nx is a power of 2)
- The FFT coefficients (twiddle factors) are generated using the program
tw_radix2 provided in the directory ‘support\fft’.
Implementation Notes
- Loads input x and coefficient w as words.
- Both loops j and i0 shown in the C code are placed in the INNERLOOP of
the assembly code.
Benchmarks The performance of this function has not yet been characterized on the C64x.
4-10
DSP_r4fft
NOTE: This function is provided for backward compatibility with the C62x
DSPLIB. It has not been optimized for the C64x architecture. The user is ad-
vised to use one of the newly added FFT functions which have been optimized
for the C64x.
Function void DSP_r4fft (int nx, short * restrict x, const short * restrict w)
Description This routine is used to compute FFT of a complex sequence size nx, a power
of 4, with “decimation-in-frequency decomposition” method. The output is in
digit-reversed order. Each complex value is with interleaved 16-bit real and
imaginary parts.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_r4fft (int nx, short x[ ], short w[ ])
{
int n1, n2, ie, ia1, ia2, ia3, i0, i1, i2, i3,
j, k;
short t, r1, r2, s1, s2, co1, co2, co3, si1,
si2, si3;
n2 = nx;
ie = 1;
for (k = nx; k > 1; k >>= 2) {
n1 = n2;
n2 >>= 2;
ia1 = 0;
for (j = 0; j < n2; j++) {
ia2 = ia1 + ia1;
ia3 = ia2 + ia1;
co1 = w[ia1 * 2 + 1];
4-12
DSP_r4fft
>>15;
x[2 * i3 + 1] = (s2 * co3−r2 *
si3)>>15;
}
}
ie <<= 2;
}
}
Special Requirements
- 4 ≤ nx ≤ 65536 (nx a power of 4)
- The FFT coefficients (twiddle factors) are generated using the program
tw_r4fft provided in the directory ‘support\fft’.
Implementation Notes
- Loads input x and coefficient w as words.
- Both loops j and i0 shown in the C code are placed in the INNERLOOP of
the assembly code.
Benchmarks The performance of this function has not yet been characterized on the C64x.
Function void DSP_fft (const short * restrict w, int nx, short * restrict x, short * restrict y)
Description This routine is used to compute an FFT of a complex sequence of size nx, a
power of 4, with “decimation-in-frequency decomposition” method. The output
is returned in a separate array y in normal order. This routine also performs
digit reversal as a special last step. Each complex value is stored as inter-
leaved 16-bit real and imaginary parts. The code uses a special ordering of
FFT factors and memory accesses to improve performance in the presence
of cache.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* The following macro is used to obtain a digit reversed index, of a given */
/* number i, into j where the number of bits in ”i” is ”m”. For the natural */
/* form of C code, this is done by first interchanging every set of ”2 bit” */
/* pairs, followed by exchanging nibbles, followed by exchanging bytes, and */
/* finally halfwords. To give an example, condider the following number: */
/* */
/* N = FEDCBA9876543210, where each digit represents a bit, the following */
/* steps illustrate the changes as the exchanges are performed: */
/* M = DCFE98BA54761032 is the number after every ”2 bits” are exchanged. */
/* O = 98BADCFE10325476 is the number after every nibble is exchanged. */
/* P = 1032547698BADCFE is the number after every byte is exchanged. */
/* Since only 16 digits were considered this represents the digit reversed */
/* index. Since the numbers are represented as 32 bits, there is one more */
4-14
DSP_fft
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
#ifndef NOASSUME
_nassert((int)x % 8 == 0);
_nassert((int)y % 8 == 0);
_nassert((int)w % 8 == 0);
_nassert(n >= 16);
_nassert(n < 32768);
#endif
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Perform initial stages of FFT in place w/out digit reversal. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
#ifndef NOASSUME
#pragma MUST_ITERATE(1,,1);
#endif
for (stride = n, t = 0; stride > 4; stride >>= 2)
{
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Perform each of the butterflies for this particular stride. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
s = stride >> 2;
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* stride represents the seperation between the inputs of the radix */
/* 4 butterfly. The C code breaks the FFT, into two cases, one when */
/* the stride between the elements is greater than 4, other when */
/* the stride is less than 4. Since stride is greater than 16, it */
/* can be guranteed that ”s” is greater than or equal to 4. */
/* In addition it can also be shown that the loop that shares this */
/* stride will iterate at least once. The number of times this */
/* loop iterates depends on how many butterflies in this stage */
/* share a twiddle factor. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
#ifndef NOASSUME
_nassert(stride >= 16);
_nassert(s >= 4);
#pragma MUST_ITERATE(1,,1);
4-16
DSP_fft
#endif
for (i = 0; i < n; i += stride)
{
#ifndef NOASSUME
_nassert(i % 4 == 0);
_nassert(s >= 4);
#pragma MUST_ITERATE(2,,2);
#endif
for (j = 0; j < s; j += 2)
{
for (k = 0; k < 2; k++)
{
short w1c, w1s, w2c, w2s, w3c, w3s;
short x0r, x0i, x1r, x1i, x2r, x2i, x3r, x3i;
short y0r, y0i, y1r, y1i, y2r, y2i, y3r, y3i;
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Read the four samples that are the input to this */
/* particular butterfly. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
x0r = x[2*(i+j+k ) + 0]; x0i = x[2*(i+j+k ) + 1];
x1r = x[2*(i+j+k + s) + 0]; x1i = x[2*(i+j+k + s) + 1];
x2r = x[2*(i+j+k + 2*s) + 0]; x2i = x[2*(i+j+k + 2*s) + 1];
x3r = x[2*(i+j+k + 3*s) + 0]; x3i = x[2*(i+j+k + 3*s) + 1];
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Read the six twiddle factors that are needed for 3 */
/* of the four outputs. (The first output has no mpys.) */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
w1s = w[t + 2*k + 6*j + 0]; w1c = w[t + 2*k + 6*j + 1];
w2s = w[t + 2*k + 6*j + 4]; w2c = w[t + 2*k + 6*j + 5];
w3s = w[t + 2*k + 6*j + 8]; w3c = w[t + 2*k + 6*j + 9];
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Calculate the four outputs, remembering that radix4 */
/* FFT accepts 4 inputs and produces 4 outputs. If we */
/* imagine the inputs as being complex, and look at the */
/* first stage as an example: */
/* */
4-18
DSP_fft
y0r = xt0;
y0i = yt0;
y1r = (xt1 * w1c + yt1 * w1s) >> 15;
y1i = (yt1 * w1c − xt1 * w1s) >> 15;
y2r = (xt2 * w2c + yt2 * w2s) >> 15;
y2i = (yt2 * w2c − xt2 * w2s) >> 15;
y3r = (xt3 * w3c + yt3 * w3s) >> 15;
y3i = (yt3 * w3c − xt3 * w3s) >> 15;
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Store the final results back to the input array. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Offset to next subtable of twiddle factors. With each iteration */
/* of the above block, six twiddle factors get read, s times, */
/* hence the offset into the twiddle factor array is advanved by */
/* this amount. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
t += 6 * s;
}
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Get the magnitude of ”n”, so we know how many digits to reverse. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
for (i = 31, m = 1; (n & (1 << i)) == 0; i−−, m++) ;
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Perform final stage with digit reversal. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
s = n >> 2;
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* One of the nice features, of this last stage are that, no multiplies */
/* are required. In addition the data always strides by a fixed amount */
/* namely 8 elements. Since the data is stored as interleaved pairs, of */
/* real and imaginary data, the first eight elements contain the data */
/* for the first four complex inputs. These are the inputs to the first */
/* radix4 butterfly. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
#ifndef NOASSUME
#pragma MUST_ITERATE(4,,4);
#endif
for (i = 0; i < n; i += 4)
4-20
DSP_fft
{
short x0r, x0i, x1r, x1i, x2r, x2i, x3r, x3i;
short y0r, y0i, y1r, y1i, y2r, y2i, y3r, y3i;
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Read the four samples that are the input to this butterfly. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
x0r = x[2*(i + 0) + 0]; x0i = x[2*(i + 0) + 1];
x1r = x[2*(i + 1) + 0]; x1i = x[2*(i + 1) + 1];
x2r = x[2*(i + 2) + 0]; x2i = x[2*(i + 2) + 1];
x3r = x[2*(i + 3) + 0]; x3i = x[2*(i + 3) + 1];
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Calculate the final FFT result from this butterfly. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
y0r = (x0r + x2r) + (x1r + x3r);
y0i = (x0i + x2i) + (x1i + x3i);
y1r = (x0r − x2r) + (x1i − x3i);
y1i = (x0i − x2i) − (x1r − x3r);
y2r = (x0r + x2r) − (x1r + x3r);
y2i = (x0i + x2i) − (x1i + x3i);
y3r = (x0r − x2r) − (x1i − x3i);
y3i = (x0i − x2i) + (x1r − x3r);
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Digit reverse our address to convert the digit−reversed input */
/* into a linearized output order. This actually results in a */
/* digit−reversed store pattern since we’re loading linearly, but */
/* the end result is that the FFT bins are in linear order. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
DIG_REV(i, m, j); /* Note: Result is assigned to ’j’ by the macro. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Store out the final FFT results. */
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
y[2*(j + 0) + 0] = y0r; y[2*(j + 0) + 1] = y0i;
y[2*(j + s) + 0] = y1r; y[2*(j + s) + 1] = y1i;
y[2*(j + 2*s) + 0] = y2r; y[2*(j + 2*s) + 1] = y2i;
y[2*(j + 3*s) + 0] = y3r; y[2*(j + 3*s) + 1] = y3i;
}
}
Special Requirements
- In-place computation is not allowed.
- Input data x[ ] is stored in the order real0, img0, real1, img1, ...
Implementation Notes
- Loads input x[ ] and coefficient w[ ] as double words.
- Both loops j and i0 shown in the C code are placed in the inner loop of the
assembly code.
4-22
DSP_fft16x16r
DSP_fft16x16r Complex Forward Mixed Radix 16- x 16-bit FFT With Rounding
Function void DSP_fft16x16r(int nx, short * restrict x, const short * restrict w, const un-
signed char * restrict brev, short * restrict y, int radix, int offset, int nmax)
brev[64] Pointer to bit reverse table containing 64 entries. Only required for
C code. Use NULL for assembly code since BITR instruction
is used instead.
Description This routine implements a cache-optimized complex forward mixed radix FFT
with scaling, rounding and digit reversal. Input data x[ ], output data y[ ] and
coefficients w[ ] are 16-bit. The output is returned in the separate array y[ ] in
normal order. Each complex value is stored as interleaved 16-bit real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors).
This redundant set of twiddle factors is size 2*N short samples. As pointed out
later dividing these twiddle factors by 2 will give an effective divide by 4 at each
stage to guarantee no overflow. The function is accurate to about 68dB of sig-
nal to noise ratio to the DFT function below:
void dft(int n, short x[], short y[])
{
int k,i, index;
const double PI = 3.14159654;
short * p_x;
double arg, fx_0, fx_1, fy_0, fy_1, co, si;
Scaling by 2 (i.e., >>1) takes place at each radix−4 stage except the last one.
A radix−4 stage could give a maximum bit−growth of 2 bits, which would re-
quire scaling by 4. To completely prevent overflow, the input data must be
scaled by 2^(BT−BS), where BT (total number of bit growth) = log2(nx) and BS
(number of scales by the functions) = ceil[log4(nx)−1]. All shifts are rounded
to reduce truncation noise power by 3dB.
The function takes the twiddle factors and input data, and calculates the FFT
producing the frequency domain data in the y[ ] array. As the FFT allows every
input point to effect every output point, in a cache based system this causes
cache thrashing. This is mitigated by allowing the main FFT of size N to be di-
vided into several steps, allowing as much data reuse as possible. For exam-
ple the following function:
is equivalent to:
4-24
DSP_fft16x16r
Notice how the 1st FFT function is called on the entire 1K data set it covers the
1st pass of the FFT until the butterfly size is 256.
The following 4 FFTs do 256-point FFTs 25% of the size. These continue down
to the end when the butterfly is of size 4. They use an index to the main twiddle
factor array of 0.75*2*N. This is because the twiddle factor array is composed
of successively decimated versions of the main array.
N not equal to a power of 4 can be used, i.e. 512. In this case to decompose
the FFT the following would be needed :
is equivalent to:
DSP_fft16x16r(512, &x[0], &w[0], y,brev,128, 0,512);
DSP_fft16x16r(128, &x[2*0], &w[2*384],y,brev,2, 0,512);
DSP_fft16x16r(128, &x[2*128],&w[2*384],y,brev,2, 128,512);
DSP_fft16x16r(128, &x[2*256],&w[2*384],y,brev,2, 256,512);
DSP_fft16x16r(128, &x[2*384],&w[2*384],y,brev,2, 384,512);
The twiddle factor array is composed of log4(N) sets of twiddle factors, (3/4)*N,
(3/16)*N, (3/64)*N, etc. The index into this array for each stage of the FFT is
calculated by summing these indices up appropriately. For multiple FFTs they
can share the same table by calling the small FFTs from further down in the
twiddle factor array, in the same way as the decomposition works for more data
reuse.
Thus, the above decomposition can be summarized for a general N, radix “rad”
as follows:
DSP_fft16x16r(N, &x[0], &w[0], brev,y,N/4,0, N)
DSP_fft16x16r(N/4,&x[0], &w[2*3*N/4],brev,y,rad,0, N)
DSP_fft16x16r(N/4,&x[2*N/4], &w[2*3*N/4],brev,y,rad,N/4, N)
DSP_fft16x16r(N/4,&x[2*N/2], &w[2*3*N/4],brev,y,rad,N/2, N)
DSP_fft16x16r(N/4,&x[2*3*N/4],&w[2*3*N/4],brev,y,rad,3*N/4,N)
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void fft16x16r
(
int n,
short *ptr_x,
short *ptr_w,
unsigned char *brev,
short *y,
int radix,
int offset,
int nmax
)
{
int i, l0, l1, l2, h2, predj;
int l1p1,l2p1,h2p1, tw_offset, stride, fft_jmp;
short xt0, yt0, xt1, yt1, xt2, yt2;
short si1,si2,si3,co1,co2,co3;
short xh0,xh1,xh20,xh21,xl0,xl1,xl20,xl21;
short x_0, x_1, x_l1, x_l1p1, x_h2 , x_h2p1, x_l2, x_l2p1;
4-26
DSP_fft16x16r
short *x,*w;
short *ptr_x0, *ptr_x2, *y0;
unsigned int j, k, j0, j1, k0, k1;
short x0, x1, x2, x3, x4, x5, x6, x7;
short xh0_0, xh1_0, xh0_1, xh1_1;
short xl0_0, xl1_0, xl0_1, xl1_1;
short yt3, yt4, yt5, yt6, yt7;
unsigned a, num;
stride = n; /* n is the number of complex samples */
tw_offset = 0;
while (stride > radix)
{
j = 0;
fft_jmp = stride + (stride>>1);
h2 = stride>>1;
l1 = stride;
l2 = stride + (stride>>1);
x = ptr_x;
w = ptr_w + tw_offset;
for (i = 0; i < n>>1; i += 2)
{
co1 = w[j+0];
si1 = w[j+1];
co2 = w[j+2];
si2 = w[j+3];
co3 = w[j+4];
si3 = w[j+5];
j += 6;
x_0 = x[0];
x_1 = x[1];
x_h2 = x[h2];
x_h2p1 = x[h2+1];
x_l1 = x[l1];
x_l1p1 = x[l1+1];
x_l2 = x[l2];
x_l2p1 = x[l2+1];
4-28
DSP_fft16x16r
j = offset>>2;
ptr_x0 = ptr_x;
y0 = y;
/* determine _norm(nmax) − 17 */
l0 = 31;
if (((nmax>>31)&1)==1)
num = ~nmax;
else
num = nmax;
if (!num)
l0 = 32;
else
{
a=num&0xFFFF0000; if (a) { l0−=16; num=a; }
a=num&0xFF00FF00; if (a) { l0−= 8; num=a; }
a=num&0xF0F0F0F0; if (a) { l0−= 4; num=a; }
a=num&0xCCCCCCCC; if (a) { l0−= 2; num=a; }
a=num&0xAAAAAAAA; if (a) { l0−= 1; }
}
l0 −= 1;
l0 −= 17;
if(radix == 2 || radix == 4)
for (i = 0; i < n; i += 4)
{
/* reversal computation */
j0 = (j ) & 0x3F;
j1 = (j >> 6) & 0x3F;
k0 = brev[j0];
k1 = brev[j1];
k = (k0 << 6) | k1;
if (l0 < 0)
k = k << −l0;
else
k = k >> l0;
j++; /* multiple of 4 index */
x0 = ptr_x0[0]; x1 = ptr_x0[1];
x2 = ptr_x0[2]; x3 = ptr_x0[3];
x4 = ptr_x0[4]; x5 = ptr_x0[5];
x6 = ptr_x0[6]; x7 = ptr_x0[7];
ptr_x0 += 8;
xh0_0 = x0 + x4;
xh1_0 = x1 + x5;
xh0_1 = x2 + x6;
xh1_1 = x3 + x7;
if (radix == 2)
{
xh0_0 = x0;
xh1_0 = x1;
xh0_1 = x2;
xh1_1 = x3;
}
4-30
DSP_fft16x16r
if (radix == 2)
{
yt7 = xl1_0 − xl0_1;
yt3 = xl1_0 + xl0_1;
}
y0[k] = yt0; y0[k+1] = yt1;
k += n>>1;
y0[k] = yt2; y0[k+1] = yt3;
k += n>>1;
y0[k] = yt4; y0[k+1] = yt5;
k += n>>1;
y0[k] = yt6; y0[k+1] = yt7;
}
}
Special Requirements
- In-place computation is not allowed.
- nx must be a power of 2 or 4.
- All data are in short precision or Q.15 format. Allowed input dynamic range
is 16 − (log2(nx)−ceil[log4(nx)−1]).
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft16x16 provided in the directory ‘support\fft’. The scale factor must be
32767.5. The input data must be scaled by 2^ (log2(nx)−ceil[log4(nx)−1])
to completely prevent overflow.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- The routine uses log4(nx) − 1 stages of radix-4 transform and performs ei-
ther a radix-2 or radix-4 transform on the last stage depending on nx. If nx
- The butterfly is bit reversed, i.e. the inner 2 points of the butterfly are
crossed over, this has the effect of making the data come out in bit re-
versed rather than in radix 4 digit reversed order. This simplifies the last
pass of the loop. The BITR instruction is used to do the bit reversal out of
place.
4-32
DSP_fft16x16t
DSP_fft16x16t Complex Forward Mixed Radix 16- x 16-bit FFT With Truncation
Function void DSP_fft16x16t(const short * restrict w, int nx, short * restrict x, short * re-
strict y)
Description This routine computes a complex forward mixed radix FFT with truncation and
digit reversal. Input data x[ ], output data y[ ] and coefficients w[ ] are 16-bit.
The output is returned in the separate array y[ ] in normal order. Each complex
value is stored with interleaved real and imaginary parts. The code uses a spe-
cial ordering of FFT coefficients (also called twiddle factors) and memory ac-
cesses to improve performance in the presence of cache.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* The following macro is used to obtain a digit reversed index, of a given */
/* number i, into j where the number of bits in ”i” is ”m”. For the natural */
/* form of C code, this is done by first interchanging every set of ”2 bit” */
/* pairs, followed by exchanging nibbles, followed by exchanging bytes, and */
/* finally halfwords. To give an example, condider the following number: */
/* */
/* N = FEDCBA9876543210, where each digit represents a bit, the following */
/* steps illustrate the changes as the exchanges are performed: */
/* M = DCFE98BA54761032 is the number after every ”2 bits” are exchanged. */
/* O = 98BADCFE10325476 is the number after every nibble is exchanged. */
/* P = 1032547698BADCFE is the number after every byte is exchanged. */
/* Since only 16 digits were considered this represents the digit reversed */
/* index. Since the numbers are represented as 32 bits, there is one more */
/* step typically of exchanging the half words as well. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
#if TMS320C6X
4-34
DSP_fft16x16t
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Determine the magnitude od the number of points to be transformed. */
/* Check whether we can use a radix4 decomposition or a mixed radix */
/* transformation, by determining modulo 2. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
for (i = 31, m = 1; (npoints & (1 << i)) == 0; i−−, m++) ;
radix = m & 1 ? 2 : 4;
norm = m − 2;
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* The stride is quartered with every iteration of the outer loop. It */
/* denotes the seperation between any two adjacent inputs to the butter */
/* −fly. This should start out at N/4, hence stride is initially set to */
/* N. For every stride, 6*stride twiddle factors are accessed. The */
/* ”tw_offset” is the offset within the current twiddle factor sub− */
/* table. This is set to zero, at the start of the code and is used to */
/* obtain the appropriate sub−table twiddle pointer by offseting it */
/* with the base pointer ”ptr_w”. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
stride = npoints;
tw_offset = 0;
fft_jmp = 6 * stride;
while (stride > radix)
{
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* At the start of every iteration of the outer loop, ”j” is set */
/* to zero, as ”w” is pointing to the correct location within the */
/* twiddle factor array. For every iteration of the inner loop */
/* 6 * stride twiddle factors are accessed. For eg, */
/* */
/* #Iteration of outer loop # twiddle factors #times cycled */
/* 1 6 N/4 1 */
/* 2 6 N/16 4 */
/* ... */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
j = 0;
fft_jmp >>= 2;
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Set up offsets to access ”N/4”, ”N/2”, ”3N/4” complex point or */
/* ”N/2”, ”N”, ”3N/2” half word */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
h2 = stride>>1;
l1 = stride;
l2 = stride + (stride >> 1);
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Reset ”x” to point to the start of the input data array. */
/* ”tw_offset” starts off at 0, and increments by ”6 * stride” */
/* The stride quarters with every iteration of the outer loop */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
x = ptr_x;
w = ptr_w + tw_offset;
tw_offset += fft_jmp;
stride >>= 2;
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* The following loop iterates through the different butterflies, */
/* within a given stage. Recall that there are logN to base 4 */
/* stages. Certain butterflies share the twiddle factors. These */
/* are grouped together. On the very first stage there are no */
/* butterflies that share the twiddle factor, all N/4 butter− */
/* flies have different factors. On the next stage two sets of */
/* N/8 butterflies share the same twiddle factor. Hence after */
/* half the butterflies are performed, j the index into the */
/* factor array resets to 0, and the twiddle factors are reused. */
/* When this happens, the data pointer ’x’ is incremented by the */
/* fft_jmp amount. In addition the following code is unrolled to */
/* perform ”2” radix4 butterflies in parallel. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
for (i = 0; i < npoints; i += 8)
{
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Read the first 12 twiddle factors, six of which are used */
/* for one radix 4 butterfly and six of which are used for */
/* next one. */
4-36
DSP_fft16x16t
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
co10 = w[j+1]; si10 = w[j+0];
co11 = w[j+3]; si11 = w[j+2];
co20 = w[j+5]; si20 = w[j+4];
co21 = w[j+7]; si21 = w[j+6];
co30 = w[j+9]; si30 = w[j+8];
co31 = w[j+11]; si31 = w[j+10];
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Read in the first complex input for the butterflies. */
/* 1st complex input to 1st butterfly: x[0] + jx[1] */
/* 1st complex input to 2nd butterfly: x[2] + jx[3] */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
x_0 = x[0]; x_1 = x[1];
x_2 = x[2]; x_3 = x[3];
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Read in the complex inputs for the butterflies. Each of the*/
/* successive complex inputs of the butterfly are seperated */
/* by a fixed amount known as stride. The stride starts out */
/* at N/4, and quarters with every stage. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
x_l1_0 = x[l1 ]; x_l1_1 = x[l1+1];
x_l1_2 = x[l1+2]; x_l1_3 = x[l1+3];
x_l2_0 = x[l2 ]; x_l2_1 = x[l2+1];
x_l2_2 = x[l2+2]; x_l2_3 = x[l2+3];
x_h2_0 = x[h2 ]; x_h2_1 = x[h2+1];
x_h2_2 = x[h2+2]; x_h2_3 = x[h2+3];
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Two butterflies are evaluated in parallel. The following */
/* results will be shown for one butterfly only, although */
/* both are being evaluated in parallel. */
/* */
/* Perform DSP_radix2 style DIF butterflies. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
xh0_0 = x_0 + x_l1_0; xh1_0 = x_1 + x_l1_1;
xh0_1 = x_2 + x_l1_2; xh1_1 = x_3 + x_l1_3;
xl0_0 = x_0 − x_l1_0; xl1_0 = x_1 − x_l1_1;
4-38
DSP_fft16x16t
4-40
DSP_fft16x16t
}
else
{
y1 = y0 + (int) (npoints >> 1);
y3 = y2 + (int) (npoints >> 1);
l1 = norm + 2;
j0 = 4;
n0 = npoints >> 2;
}
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* The following code reads data indentically for either a radix 4 */
/* or a radix 2 style decomposition. It writes out at different */
/* locations though. It checks if either half the points, or a */
/* quarter of the complex points have been exhausted to jump to */
/* pervent double reversal. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
j = 0;
for (i = 0; i < npoints; i += 8)
{
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Digit reverse the index starting from 0. The increment to ”j” */
/* is either by 4, or 8. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
DIG_REV(j, l1, h2);
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Read in the input data, from the first eight locations. These */
/* are transformed either as a radix4 or as a radix 2. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
x_0 = x0[0]; x_1 = x0[1];
x_2 = x0[2]; x_3 = x0[3];
x_4 = x0[4]; x_5 = x0[5];
x_6 = x0[6]; x_7 = x0[7];
x0 += 8;
xh0_0 = x_0 + x_4; xh1_0 = x_1 + x_5;
xl0_0 = x_0 − x_4; xl1_0 = x_1 − x_5;
xh0_1 = x_2 + x_6; xh1_1 = x_3 + x_7;
4-42
DSP_fft16x16t
if (radix == 2)
{
n02 = x_8 + x_a; n03 = x_9 + x_b;
n22 = x_8 − x_a; n23 = x_9 − x_b;
n12 = x_c + x_e; n13 = x_d + x_f;
n32 = x_c − x_e; n33 = x_d − x_f;
}
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Points that are read from succesive locations map to y, y[N/4] */
/* y[N/2], y[3N/4] in a radix4 scheme, y, y[N/8], y[N/2],y[5N/8] */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
y0[2*h2+2] = n02; y0[2*h2+3] = n03;
y1[2*h2+2] = n12; y1[2*h2+3] = n13;
y2[2*h2+2] = n22; y2[2*h2+3] = n23;
y3[2*h2+2] = n32; y3[2*h2+3] = n33;
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
/* Increment ”j” by ”j0”. If j equals n0, then increment both ”x0” */
/* and ”x2” so that double inversion is avoided. */
/*−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−*/
j += j0;
if (j == n0)
{
j += n0;
x0 += (int) npoints>>1;
x2 += (int) npoints>>1;
}
}
}
Special Requirements
- In-place computation is not allowed.
- The arrays for the complex input data x[ ], complex output data y[ ] and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary compo-
nents stored in adjacent locations in the array. The real components are
stored at even array indices, and the imaginary components are stored at
odd array indices. All data are in short precision or Q.15 format.
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft16x16 provided in the directory ‘support\fft’. The scale factor must be
32767.5. No scaling is done with the function; thus the input data must be
scaled by 2^ log2(nx) to completely prevent overflow.
Implementation Notes
- Bank Conflicts: nx/8 bank conflicts occur.
The routine uses log4(nx) − 1 stages of radix-4 transform and performs either
a radix-2 or radix-4 transform on the last stage depending on nx. If nx is a pow-
er of 4,then this last stage is also a radix-4 transform, otherwise it is a radix-2
transform. The conventional Cooley Tukey FFT, is written using three loops.
The outermost loop “k” cycles through the stages. There are log N to the base
4 stages in all. The loop “j” cycles through the groups of butterflies with different
twiddle factors, loop “i” reuses the twiddle factors for the different butterflies
within a stage. It is interesting to note the following:
1 N/4 1 N/4
2 N/16 4 N/4
.. .. .. ..
1) Inner loop “i0” iterates a variable number of times. In particular the number
of iterations quadruples every time from 1..N/4. Hence software pipelining
a loop that iterates a variable number of times is not profitable.
2) Outer loop “j” iterates a variable number of times as well. However the
number of iterations is quartered every time from N/4 ..1. Hence the be-
havior in (a) and (b) are exactly opposite to each other.
3) If the two loops “i” and “j” are coalesced together then they will iterate for
a fixed number of times namely N/4. This allows us to combine the “i” and
“j” loops into 1 loop. Optimized implementations will make use of this fact.
4-44
DSP_fft16x16t
In addition the Cooley Tukey FFT accesses three twiddle factors per iteration
of the inner loop, as the butterflies that re-use twiddle factors are lumped to-
gether. This leads to accessing the twiddle factor array at three points each
separated by “ie”. Note that “ie” is initially 1, and is quadrupled with every itera-
tion. Therefore these three twiddle factors are not even contiguous in the array.
In order to vectorize the FFT, it is desirable to access twiddle factor array using
double word wide loads and fetch the twiddle factors needed. In order to do
this a modified twiddle factor array is created, in which the factors WN/4, WN/2,
W3N/4 are arranged to be contiguous. This eliminates the separation between
twiddle factors within a butterfly. However this implies that as the loop is tra-
versed from one stage to another, that we maintain a redundant version of the
twiddle factor array. Hence the size of the twiddle factor array increases as
compared to the normal Cooley Tukey FFT. The modified twiddle factor array
is of size “2 * N” where the conventional Cooley Tukey FFT is of size “3N/4”
where N is the number of complex points to be transformed. The routine that
generates the modified twiddle factor array was presented earlier. With the
above transformation of the FFT, both the input data and the twiddle factor
array can be accessed using double-word wide loads to enable packed data
processing.
The code shown here performs the bulk of the computation in place. However,
because digit-reversal cannot be performed in-place, the final result is written
to a separate array, y[].
There is one slight break in the flow of packed processing that needs to be
comprehended. The real part of the complex number is in the lower half, and
the imaginary part is in the upper half. The flow breaks in case of “xl0” and “xl1”
because in this case the real part needs to be combined with the imaginary part
because of the multiplication by “j”. This requires a packed quantity like
“xl21xl20” to be rotated as “xl20xl21” so that it can be combined using ADD2’s
and SUB2’s. Hence the natural version of C code shown below is transformed
using packed data processing as shown:
Also notice that xt1, yt1 end up on separate words, these need to be packed
together to take advantage of the packed twiddle factors that have been
loaded. In order for this to be achieved they are re-aligned as follows:
The packed words “yt1_xt1” allows the loaded “sc” twiddle factor to be used
for the complex multiplies. The real part of the complex multiply is implemented
using DOTP2. The imaginary part of the complex multiply is implemented us-
ing DOTPN2 after the twiddle factors are swizzled within the half word.
The actual twiddle factors for the FFT are cosine, − sine. The twiddle factors
stored in the table are cosine and sine, hence the sign of the ”sine” term is com-
prehended during multiplication as shown above.
4-46
DSP_fft16x32
DSP_fft16x32 Complex Forward Mixed Radix 16- x 32-bit FFT With Rounding
Function void DSP_fft16x32(const short * restrict w, int nx, int * restrict x, int * restrict y)
Description This routine computes an extended precision complex forward mixed radix
FFT with rounding and digit reversal. Input data x[ ] and output data y[ ] are
32-bit, coefficients w[ ] are 16-bit. The output is returned in the separate array
y[ ] in normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache. The C code to generate the twiddle factors is the same as
the one used for the DSP_fft16x16r routine.
Algorithm The C equivalent of the assembly code without restrictions is similar to the one
shown for the DSP_fft16x16t routine. For further details refer to the source
code of the C version of this function which is provided with this library. Note
that the assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ] and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary compo-
nents stored in adjacent locations in the array. The real components are
stored at even array indices, and the imaginary components are stored at
odd array indices.
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft16x32 provided in the directory ‘support\fft’. The scale factor must be
32767.5. No scaling is done with the function; thus the input data must be
scaled by 2^ log2(nx) to completely prevent overflow.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- The routine uses log4(nx) − 1 stages of radix-4 transform and performs ei-
ther a radix-2 or radix-4 transform on the last stage depending on nx. If nx
is a power of 4,then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
4-48
DSP_fft32x32
DSP_fft32x32 Complex Forward Mixed Radix 32- x 32-bit FFT With Rounding
Function void DSP_fft32x32(const int * restrict w, int nx, int * restrict x, int * restrict y)
Description This routine computes an extended precision complex forward mixed radix
FFT with rounding and digit reversal. Input data x[ ], output data y[ ] and coeffi-
cients w[ ] are 32-bit. The output is returned in the separate array y[ ] in normal
order. Each complex value is stored with interleaved real and imaginary parts.
The code uses a special ordering of FFT coefficients (also called twiddle fac-
tors) and memory accesses to improve performance in the presence of cache.
The C code to generate the twiddle factors is similar to the one used for the
DSP_fft16x16r routine except that the factors are maintained at 32-bit preci-
sion.
Algorithm The C equivalent of the assembly code without restrictions is similar to the one
shown for the DSP_fft16x16t routine. For further details refer to the source
code of the C version of this function which is provided with this library. Note
that the assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ] and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary compo-
nents stored in adjacent locations in the array. The real components are
stored at even array indices, and the imaginary components are stored at
odd array indices.
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be
2147483647.5. No scaling is done with the function; thus the input data
must be scaled by 2^ log2(nx) to completely prevent overflow.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- The routine uses log4(nx) − 1 stages of radix-4 transform and performs ei-
ther a radix-2 or radix-4 transform on the last stage depending on nx. If nx
is a power of 4,then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- The 32 by 32 multiplies are done with a 1.5 bit loss in accuracy. This comes
about because the contribution of the low 16 bits to the 32 bit result is not
computed. In addition the contribution of the low * high term is shifted by
16 as opposed to 15, for a loss of 0.5 bits after rounding.
4-50
DSP_fft32x32s
DSP_fft32x32s Complex Forward Mixed Radix 32- x 32-bit FFT With Scaling
Function void DSP_fft32x32s(const int * restrict w, int nx, int * restrict x, int * restrict y)
Description This routine computes an extended precision complex forward mixed radix
FFT with scaling, rounding and digit reversal. Input data x[ ], output data y[ ]
and coefficients w[ ] are 32-bit. The output is returned in the separate array y[ ]
in normal order. Each complex value is stored with interleaved real and imagi-
nary parts. The code uses a special ordering of FFT coefficients (also called
twiddle factors) and memory accesses to improve performance in the pres-
ence of cache. The C code to generate the twiddle factors is the same one
used for the DSP_fft32x32 routine.
Scaling by 2 (i.e., >>1) takes place at each radix−4 stage except for the last
one. A radix−4 stage can add a maximum of 2 bits, which would require scaling
by 4 to completely prevent overflow. Thus, the input data must be scaled by
2^(log2(nx)−ceil[log4(nx)−1]).
Algorithm The C equivalent of the assembly code without restrictions is similar to the one
shown for the fft16x16t routine. For further details refer to the source code of
the C version of this function which is provided with this library. Note that the
assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ] and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary compo-
nents stored in adjacent locations in the array. The real components are
stored at even array indices, and the imaginary components are stored at
odd array indices.
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be
1073741823.5. The input data must be scaled by 2^ (log2(nx) − ceil[
log4(nx)−1 ]) to completely prevent overflow.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- The routine uses log4(nx) − 1 stages of radix-4 transform and performs ei-
ther a radix-2 or radix-4 transform on the last stage depending on nx. If nx
is a power of 4,then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
- The 32 by 32 multiplies are done with a 1.5 bit loss in accuracy. This comes
about because the contribution of the low 16 bits to the 32 bit result is not
computed. In addition the contribution of the low * high term is shifted by
16 as opposed to 15, for a loss of 0.5 bits after rounding.
4-52
DSP_ifft16x32
DSP_ifft16x32 Complex Inverse Mixed Radix 16- x 32-bit FFT With Rounding
Function void DSP_ifft16x32(const short * restrict w, int nx, int * restrict x, int * restrict
y)
Description This routine computes an extended precision complex inverse mixed radix
FFT with rounding and digit reversal. Input data x[ ] and output data y[ ] are
32-bit, coefficients w[ ] are 16-bit. The output is returned in the separate array
y[ ] in normal order. Each complex value is stored with interleaved real and
imaginary parts. The code uses a special ordering of FFT coefficients (also
called twiddle factors) and memory accesses to improve performance in the
presence of cache.
In reality one can re-use fft16x32 to perform IFFT, by first conjugating the input,
performing the FFT, conjugating again. This allows fft16x32 to perform the
IFFT as well. However if the double conjugation needs to be avoided then this
routine uses the same twiddle factors as the FFT and performs an IFFT. The
change in the sign of the twiddle factors is adjusted for in the routine. Hence
this routine uses the same twiddle factors as the fft16x32 routine.
Algorithm The C equivalent of the assembly code without restrictions is similar to the one
shown for the fft16x16t routine. For further details refer to the source code of
the C version of this function which is provided with this library. Note that the
assembly code is hand optimized and restrictions may apply.
Special Requirements
- In-place computation is not allowed.
- The size of the FFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ] and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary compo-
nents stored in adjacent locations in the array. The real components are
stored at even array indices, and the imaginary components are stored at
odd array indices.
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft16x32 provided in the directory ‘support\fft’. The scale factor must be
32767.5. No scaling is done with the function; thus the input data must be
scaled by 2^ log2(nx) to completely prevent overflow.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- The routine uses log4(nx) − 1 stages of radix-4 transform and performs ei-
ther a radix-2 or radix-4 transform on the last stage depending on nx. If nx
is a power of 4,then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
4-54
DSP_ifft32x32
DSP_ifft32x32 Complex Inverse Mixed Radix 32- x 32-bit FFT With Rounding
Function void DSP_ifft32x32(const int * restrict w, int nx, int * restrict x, int * restrict y)
Arguments w[2*nx] Pointer to complex 32-bit FFT coefficients.
- The size of the IFFT, nx, must be a power of 4 or 2 and greater than or equal
to 16 and less than 32768.
- The arrays for the complex input data x[ ], complex output data y[ ] and
twiddle factors w[ ] must be double-word aligned.
- The input and output data are complex, with the real/imaginary compo-
nents stored in adjacent locations in the array. The real components are
stored at even array indices, and the imaginary components are stored at
odd array indices.
- The FFT coefficients (twiddle factors) are generated using the program
tw_fft32x32 provided in the directory ‘support\fft’. The scale factor must be
2147483647.5. No scaling is done with the function; thus the input data
must be scaled by 2^ log2(nx) to completely prevent overflow.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
- The routine uses log4(nx) − 1 stages of radix-4 transform and performs ei-
ther a radix-2 or radix-4 transform on the last stage depending on nx. If nx
is a power of 4,then this last stage is also a radix-4 transform, otherwise
it is a radix-2 transform.
4-56
DSP_fir_cplx
Function void DSP_fir_cplx (const short * restrict x, const short * restrict h, short * restrict
r, int nh, int nr)
Description This function implements the FIR filter for complex input data. The filter has
nr output samples and nh coefficients. Each array consists of an even and odd
term with even terms representing the real part and the odd terms the imagi-
nary part of the element. The pointer to input array x must point to the (nh)th
complex sample, i.e., element 2*(nh−1), upon entry to the function. The coeffi-
cients are expected in normal order.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_cplx(short *x, short *h, short *r,short nh, short
nr)
{
short i,j;
int imag, real;
for (i = 0; i < 2*nr; i += 2){
imag = 0;
real = 0;
for (j = 0; j < 2*nh; j += 2){
real += h[j] * x[i−j] − h[j+1] * x[i+1−j];
imag += h[j] * x[i+1−j] + h[j+1] * x[i−j];
}
r[i] = (real >> 15);
r[i+1] = (imag >> 15);
}
}
Special Requirements
- The number of coefficients nh must be a multiple of 2.
Implementation Notes
- The outer loop is unrolled 4 times while the inner loop is not unrolled.
- ADDAH and SUBAH are used along with PACKH2 to perform accumula-
tion, shift and data packing.
Benchmarks Cycles nr * nh + 24
4-58
DSP_fir_gen
Function void DSP_fir_gen (const short * restrict x, const short * restrict h, short * restrict
r, int nh, int nr)
Description Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. It operates on 16-bit data with a 32-bit accumulate. The filter calcu-
lates nr output samples using nh coefficients.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_gen(short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
Special Requirements
- nh, the number of coefficients, must be greater than or equal to 5. Coeffi-
cients must be in reverse order.
Implementation Notes
- Load double-word instruction is used to simultaneously load four values
in a single clock cycle.
- The inner loop is unrolled four times and will always compute a multiple
of 4 of nh and nr. If nh is not a multiple of 4, the code will fill in zeros to make
nh a multiple of 4.
- This code yields best performance when ratio of outer loop to inner loop
is less than or equal to 4.
4-60
DSP_fir_r4
Function void DSP_fir_r4 (const short * restrict x, const short * restrict h, short * restrict
r, int nh, int nr)
Description Computes a real FIR filter (direct-form) using coefficients stored in vector h[ ].
The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_r4(short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
Special Requirements
- nh, the number of coefficients, must be a multiple of 4 and greater than or
equal to 8. Coefficients must be in reverse order.
Implementation Notes
- The load double-word instruction is used to simultaneously load four val-
ues in a single clock cycle.
- The inner loop is unrolled four times and will always compute a multiple
of 4 output samples.
4-62
DSP_fir_r8
Function void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)
Description Computes a real FIR filter (direct-form) using coefficients stored in vector h[
]. The real data input is stored in vector x[ ]. The filter output result is stored in
vector r[ ]. This FIR operates on 16-bit data with a 32-bit accumulate. The filter
calculates nr output samples using nh coefficients.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_r8 (short *x, short *h, short *r, int nh, int nr)
{
int i, j, sum;
Special Requirements
- nh, the number of coefficients, must be a multiple of 8 and greater than or
equal to 8. Coefficients must be in reverse order.
Implementation Notes
- The load double-word instruction is used to simultaneously load four val-
ues in a single clock cycle.
- The inner loop is unrolled 4 times and will always compute a multiple of
4 output samples.
- The outer loop is conditionally executed in parallel with the inner loop. This
allows for a zero overhead outer loop.
4-64
DSP_fir_sym
Function void DSP_fir_sym (const short * restrict x, const short * restrict h, short * re-
strict r, int nh, int nr, int s)
Description This function applies a symmetric filter to the input samples. The filter tap array
h[] provides ‘nh+1’ total filter taps. The filter tap at h[nh] forms the center point
of the filter. The taps at h[nh − 1] through h[0] form a symmetric filter about this
central tap. The effective filter length is thus 2*nh+1 taps.
The filter is performed on 16-bit data with 16-bit coefficients, accumulating in-
termediate results to 40-bit precision. The accumulator is rounded and trun-
cated according to the value provided in ‘s’. This allows a variety of Q-points
to be used.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_fir_sym(short *x, short *h, short *r, int nh, int nr,
int s)
{
int i, j;
long y0;
long round = (long) 1 << (s − 1);
for (j = 0; j < nr; j++) {
y0 = round;
for (i = 0; i < nh; i++)
Special Requirements
- nh must be a multiple of 8. The number of original symmetric coefficients
is 2*nh+1. Only half (nh+1) are required.
- nr must be a multiple of 4.
Implementation Notes
- The load double-word instruction is used to simultaneously load four val-
ues in a single clock cycle.
4-66
DSP_iir
Function void DSP_iir (short * restrict r1, const short * restrict x, short * restrict r2, const
short * restrict h2, const short * restrict h1, int nr)
Arguments r1[nr+4] Output array (used in actual computation. First four elelemts must
have the previous outputs.)
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_iir(short *r1, short *x, short *r2, short *h2,
short *h1, int nr)
{
int j,i;
int sum;
for (i=0; i<nr; i++){
sum = h2[0] * x[4+i];
for (j = 1; j <= 4; j++)
sum += h2[j]*x[4+i−j]−h1[j]*r1[4+i−j];
r1[4+i] = (sum >> 15);
r2[i] = r1[4+i];
}
}
Special Requirements
- nr is greater than or equal to 8.
Implementation Notes
- Output array r1[ ] contains nr + 4 locations, r2[ ] contains nr locations for
storing nr output samples. The output samples are stored with an offset
of 4 into the r1[ ] array.
- The inner loop that iterated through the filter coefficients is completely un-
rolled.
Benchmarks Cycles 4 * nr + 21
4-68
DSP_iirlat
Function void DSP_iirlat(const short * restrict x, int nx, const short * restrict k, int nk, int
* restrict b, short * restrict r)
Description This routine implements a real all-pole IIR filter in lattice structure (AR lattice).
The filter consists of nk lattice stages. Each stage requires one reflection coef-
ficient k and one delay element b. The routine takes an input vector x[] and re-
turns the filter output in r[]. Prior to the first call of the routine the delay elements
in b[] should be set to zero. The input data may have to be pre-scaled to avoid
overflow or achieve better SNR. The reflections coefficients lie in the range
−1.0 < k < 1.0. The order of the coefficients is such that k[nk−1] corresponds
to the first lattice stage after the input and k[0] corresponds to the last stage.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void iirlat(short *x, int nx, short *k, int nk, int *b,
short *r)
{
int rt; /* output */
int i, j;
Special Requirements
- nk must be >= 4.
Implementation Notes
- Prolog and epilog of the inner loop are partially collapsed and overlapped
to reduce outer loop overhead.
4-70
DSP_dotp_sqr
4.5 Math
Function int DSP_dotp_sqr(int G, const short * restrict x, const short * restrict y, int * re-
strict r, int nx)
Description This routine performs a nx element dot product of x[ ] and y[ ] and stores it in
r. It also squares each element of y[ ] and accumulates it in G. G is passed back
to calling function in register A4. This computation of G is used in the VSELP
coder.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
int DSP_dotp_sqr (int G,short *x,short *y,int *r,
int nx)
{
short *y2;
short *endPtr2;
y2 = x;
for (endPtr2 = y2 + nx; y2 < endPtr2; y2++){
*r += *y * *y2;
G += *y * *y;
y++;
}
return(G);
}
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
Codesize 128
4-72
DSP_dotprod
Function int DSP_dotprod(const short * restrict x, const short * restrict y, int nx)
Description This routine takes two vectors and calculates their dot product. The inputs are
16-bit short data and the output is a 32-bit number.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
int DSP_dotprod(short x[ ],short y[ ], int nx)
{
int sum;
int i;
sum = 0;
for(i=0; i<nx; i++){
sum += (x[i] * y[i]);
}
return (sum);
}
Special Requirements
- The input length must be a multiple of 4.
- To avoid bank conflicts the input arrays x[ ] and y[ ] must be offset by 4 half-
words (8 bytes).
Implementation Notes
- The code is unrolled 4 times to enable full memory and multiplier band-
width to be utilized.
- Bank Conflicts: No bank conflicts occur if the input arrays x[ ] and y[ ] are
offset by 4 half-words (8 bytes).
Benchmarks Cycles nx / 4 + 15
4-74
DSP_maxval
Description This routine finds the element with maximum value in the input vector and re-
turns that value.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
short DSP_maxval(short x[ ], int nx)
{
int i, max;
max = −32768;
Implementation Notes
Benchmarks Cycles nx / 4 + 10
Arguments x[nx] Pointer to input vector of size nx. Must be double-word aligned.
Description This routine finds the max value of a vector and returns the index of that value.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
int DSP_maxidx(short x[ ], int nx)
{
int max, index, i;
max = −32768;
for (i = 0; i < nx; i++)
if (x[i] > max) {
max = x[i];
index = i;
}
return index;
}
Special Requirements
- nx must be a multiple of 16 and greater than or equal to 48.
Implementation Notes
- The code is unrolled 16 times to enable the full bandwidth of LDDW and
MAX2 instructions to be utilized. This splits the search into 16 sub-ranges.
The global maximum is then found from the list of maximums of the sub-
ranges. Then, using this offset from the sub-ranges, the global maximum
and the index of it are found using a simple match. For common maxi-
mums in multiple ranges, the index will be different to the above C code.
4-76
DSP_maxidx
Benchmarks Cycles 5 * nx / 16 + 42
Description This routine finds the minimum value of a vector and returns the value.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
short DSP_minval(short x[ ], int nx)
{
int i, min;
min = 32767;
Implementation Notes
- The input data is loaded using double word wide loads, and the MIN2 in-
struction is used to get to the minimum.
4-78
DSP_mul32
Function void DSP_mul32(const int * restrict x, const int * restrict y, int * restrict r, short
nx)
Arguments x[nx] Pointer to input data vector 1 of size nx. Must be double-word
aligned.
Description The function performs a Q.31 x Q.31 multiply and returns the upper 32 bits of
the result. The result of the intermediate multiplies are accumulated into a
40-bit long register pair as there could be potential overflow. The contribution
of the multiplication of the two lower 16-bit halves are not considered. The out-
put is in Q.30 format. Results are accurate to least significant bit.
Algorithm In the comments below, X and Y are the two input values. Xhigh and Xlow rep-
resent the upper and lower 16 bits of X. This is the C equivalent of the assembly
code without restrictions. Note that the assembly code is hand optimized and
restrictions may apply.
void DSP_mul32(const int *x, const int *y, int *r,
short nx)
{
short i;
int a,b,c,d,e;
for(i=nx;i>0;i−−)
{
a=*(x++);
b=*(y++);
c=_mpyluhs(a,b); /* Xlow*Yhigh */
d=_mpyhslu(a,b); /* Xhigh*Ylow */
e=_mpyh(a,b); /* Xhigh*Yhigh */
d+=c; /* Xhigh*Ylow+Xlow*Yhigh */
d=d>>16; /* (Xhigh*Ylow+Xlow*Yhigh)>>16 */
e+=d; /* Xhigh*Yhigh + */
/* (Xhigh*Ylow+Xlow*Yhigh)>>16 */
*(r++)=e;
}
}
Special Requirements
- nx must be a multiple of 8 and greater than or equal to 16.
Implementation Notes
- The MPYHI instruction is used to perform 16 x 32 multiplies to form 48-bit
intermediate results.
4-80
DSP_neg32
Arguments x[nx] Pointer to input data vector 1 of size nx with 32-bit elements.
Must be double-word aligned.
r[nx] Pointer to output data vector of size nx with 32-bit elements.
Must be double-word aligned.
nx Number of elements of input and output vectors. Must be a
multiple of 4 and ≥8.
Description This function negates the elements of a vector (32-bit elements). The input and
output arrays must not be overlapped except for the special case where the
input and output pointers are exactly equal.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_neg32(int *x, int *r, short nx)
{
short i;
for(i=nx; i>0; i−−)
*(r++)=−*(x++);
}
Special Requirements
- nx must be a multiple of 4 and greater than or equal to 8.
Implementation Notes
- The loop is unrolled twice and pipelined.
Function void DSP_recip16 (short *x, short *rfrac, short *rexp, short nx)
Description This routine returns the fractional and exponential portion of the reciprocal of
an array x[ ] of Q.15 numbers. The fractional portion rfrac is returned in Q.15
format. Since the reciprocal is always greater than 1, it returns an exponent
such that:
(rfrac[i] * 2rexp[i]) = true reciprocal
The output is accurate up to the least significant bit of rfrac, but note that this
bit could carry over and change rexp. For a reciprocal of 0, the procedure will
return a fractional part of 7FFFh and an exponent of 16.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_recip16(short *x, short *rfrac, short *rexp, short
nx)
{
int i,j,a,b;
short neg, normal;
for(i=nx; i>0; i−−)
{
a=*(x++);
if(a<0) /* take absolute value */
{
a=−a;
neg=1;
}
else neg=0;
normal=_norm(a); /* normalize number */
a=a<<normal;
4-82
DSP_recip16
Implementation Notes
- The conditional subtract instruction, SUBC, is used for division. SUBC is
used once for every bit of quotient needed (15).
Benchmarks Cycles 8 * nx + 14
Description This routine returns the sum of squares of the elements contained in the vector
x[ ].
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
int DSP_vecsumsq(short x[ ], int nx)
{
int i, sum=0;
Implementation Notes
- The code is unrolled 4 times to enable full memory and multiplier band-
width to be utilized.
4-84
DSP_w_vec
m Weighting factor
Implementation Notes
- Input is loaded in double-words.
4.6 Matrix
Function void DSP_mat_mul(const short * restrict x, int r1, int c1, const short * restrict
y, int c2, short * restrict r, int qs)
Description This function computes the expression “r = x * y” for the matrices x and y. The
columnar dimension of x must match the row dimension of y. The resulting ma-
trix has the same number of rows as x and the same number of columns as y.
The values stored in the matrices are assumed to be fixed-point or integer val-
ues. All intermediate sums are retained to 32-bit precision, and no overflow
checking is performed. The results are right-shifted by a user-specified
amount, and then truncated to 16 bits.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that the
assembly code is hand optimized and restrictions may apply.
void DSP_mat_mul(short *x, int r1, int c1, short *y, int c2,
short *r, int qs)
{
int i, j, k;
int sum;
/* −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− */
/* Multiply each row in x by each column in y. The */
/* product of row m in x and column n in y is placed */
4-86
DSP_mat_mul
Special Requirements
- The arrays x[], y[], and r[] are stored in distinct arrays. That is, in-place
processing is not allowed.
- The input matrices have minimum dimensions of at least 1 row and 1 col-
umn, and maximum dimensions of 32767 rows and 32767 columns.
Implementation Notes
- The ‘i’ loop and ‘k’ loops are unrolled 2x. The ’j’ loop is unrolled 4x. For di-
mensions that are not multiples of the various loops’ unroll factors, this
code calculates extra results beyond the edges of the matrix. These extra
results are ultimately discarded. This allows the loops to be unrolled for
efficient operation on large matrices while not losing flexibility.
- Interruptibility: This code blocks interrupts during its innermost loop. In-
terrupts are not blocked otherwise. As a result, interrupts can be blocked
for up to 0.25*c1’ + 16 cycles at a time.
Benchmarks Cycles 0.25 * ( r1’ * c2’ * c1’ ) + 2.25 * ( r1’ * c2’ ) + 11, where:
r1’ = 2 * ceil(r1/2.0) (r1 rounded up to next even)
c1’ = 2 * ceil(c1/2.0) (c1 rounded up to next even)
c2’ = 4 * ceil(c2/4.0) (c2 rounded up to next mult of 4)
For r1= 1, c1= 1, c2= 1: 33 cycles
For r1= 8, c1=20, c2= 8: 475 cycles
Function void DSP_mat_trans (const short *x, short rows, short columns, short *r)
Arguments x[rows*columns] Pointer to input matrix.
Implementation Notes
- Data from four adjacent rows, spaced “columns” apart are read, and a lo-
cal 4x4 transpose is performed in the register file. This leads to four double
words, that are “rows” apart. These loads and stores can cause bank con-
flicts, hence non-aligned loads and stores are used.
- Bank Conflicts: No bank conflicts occur.
4-88
DSP_bexp
4.7 Miscellaneous
return short Return value is the maximum exponent that may be used in
scaling.
Description Computes the exponents (number of extra sign bits) of all values in the input
vector x[ ] and returns the minimum exponent. This will be useful in determining
the maximum shift value that may be used in scaling a block of data.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
short DSP_bexp(const int *x, short nx)
{
int min_val =_norm(x[0]);
short n;
int i;
for(i=1;i<nx;i++)
{
n =_norm(x[i]); /* _norm(x) = number of */
/* redundant sign bits */
if(n<min_val) min_val=n;
}
return min_val;
}
Special Requirements
- nx must be a multiple of 8.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
4-90
DSP_blk_eswap16
if (r)
{
_x = (char *)x;
_r = (char *)r;
} else
{
_x = (char *)x;
_r = (char *)r;
}
Special Requirements
- Input and output arrays do not overlap, except in the very specific case that
“r == NULL” so that the operation occurs in-place.
- The input array and output array are expected to be double-word aligned,
and a multiple of 8 half-words must be processed.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
4-92
DSP_blk_eswap32
Description The data in the x[] array is endian swapped, meaning that the byte-order of the
bytes within each word of the r[] array is reversed. This is meant to facilitate
moving big-endian data to a little-endian system or vice-versa.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that the
assembly code is hand optimized and restrictions may apply.
void DSP_blk_eswap32(void *x, void *r, int nx)
{
int i;
char *_x, *_r;
if (r)
{
_x = (char *)x;
_r = (char *)r;
} else
{
_x = (char *)x;
_r = (char *)r;
}
t2 = _x[i*4 + 1];
t3 = _x[i*4 + 0];
_r[i*4 + 0] = t0;
_r[i*4 + 1] = t1;
_r[i*4 + 2] = t2;
_r[i*4 + 3] = t3;
}
}
Special Requirements
- Input and output arrays do not overlap, except in the very specific case that
“r == NULL” so that the operation occurs in-place.
- The input array and output array are expected to be double-word aligned,
and a multiple of 4 words must be processed.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
4-94
DSP_blk_eswap64
Description The data in the x[] array is endian swapped, meaning that the byte-order of the
bytes within each double-word of the r[] array is reversed. This is meant to facil-
itate moving big-endian data to a little-endian system or vice-versa.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that the
assembly code is hand optimized and restrictions may apply.
void DSP_blk_eswap64(void *x, void *r, int nx)
{
int i;
char *_x, *_r;
if (r)
{
_x = (char *)x;
_r = (char *)r;
} else
{
_x = (char *)x;
_r = (char *)r;
}
t2 = _x[i*8 + 5];
t3 = _x[i*8 + 4];
t4 = _x[i*8 + 3];
t5 = _x[i*8 + 2];
t6 = _x[i*8 + 1];
t7 = _x[i*8 + 0];
_r[i*8 + 0] = t0;
_r[i*8 + 1] = t1;
_r[i*8 + 2] = t2;
_r[i*8 + 3] = t3;
_r[i*8 + 4] = t4;
_r[i*8 + 5] = t5;
_r[i*8 + 6] = t6;
_r[i*8 + 7] = t7;
}
}
Special Requirements
- Input and output arrays do not overlap, except in the very specific case that
“r == NULL” so that the operation occurs in-place.
- The input array and output array are expected to be double-word aligned,
and a multiple of 2 double-words must be processed.
Implementation Notes
- Bank Conflicts: No bank conflicts occur.
4-96
DSP_blk_move
Description This routine moves nx 16-bit elements from one memory location pointed to
by x to another pointed to by r. The source and destination blocks can be over-
lapped.
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_blk_move(short *x, short *r, int nx)
{
int i;
if( r < x )
{
for (I = 0; I < nx; i++)
r[i] = x[i];
} else
{
for (I = nx−1; I >= 0; i−−)
r[i] = x[i];
}
}
Implementation Notes
- Twin input and output pointers are used.
// saturate to 16−bit //
if (a>32767) a = 32767;
if (a<−32768) a = −32768;
r[i] = (short) a;
}
}
Special Requirements nx must be a multiple of 2.
Implementation Notes
- Loop is unrolled twice.
4-98
DSP_minerror
Function int minerror (const short * restrict GSP0_TABLE, const short * restrict err-
Coefs, int * restrict max_index)
Algorithm This is the C equivalent of the assembly code without restrictions. Note that the
assembly code is hand optimized and restrictions may apply.
int minerr
(
const short *restrict GSP0_TABLE,
const short *restrict errCoefs,
int *restrict max_index
)
{
int val, maxVal = −50;
int i, j;
for (i = 0; i < GSP0_NUM; i++)
{
for (val = 0, j = 0; j < GSP0_TERMS; j++)
val += GSP0_TABLE[i*GSP0_TERMS+j] * errCoefs[j];
Implementation Notes
- The load double-word instruction is used to simultaneously load four val-
ues in a single clock cycle.
4-100
DSP_q15tofl
Description Converts the values stored in vector x[ ] in Q.15 format to IEEE floating point
numbers in output vector r[ ].
Algorithm This is the C equivalent of the assembly code without restrictions. Note that
the assembly code is hand optimized and restrictions may apply.
void DSP_q15tofl(short *x, float *r, int nx)
{
int i;
for (i=0;i<nx;i++)
r[i] = (float) x[i] / 0x8000;
}
Implementation Notes
- Loop is unrolled twice
Benchmarks Cycles 2 * nx + 14
Performance/Fractional Q Formats
Topic Page
A-1
Performance Considerations
For more information on additional stall cycles due to memory hierarchy, see
the Signal Processing Examples Using TMS320C64x Digital Signal Process-
ing Library (SPRA884). The TMS320C6000 DSP Cache User’s Guide
(SPRU656A) presents how to optimize algorithms and function calls for better
cache performance.
A-2
Fractional Q Formats
Unless specifically noted, DSPLIB functions use Q15 format, or to be more ex-
act, Q0.15. In a Qm.n format, there are m bits used to represent the two’s com-
plement integer portion of the number, and n bits used to represent the two’s
complement fractional portion. m+n+1 bits are needed to store a general Qm.n
number. The extra bit is needed to store the sign of the number in the most-sig-
nificant bit position. The representable integer range is specified by (−2 m,2 m)
and the finest fractional resolution is 2 −n.
For example, the most commonly used format is Q.15. Q.15 means that a
16-bit word is used to express a signed number between positive and negative
one. The most-significant binary digit is interpreted as the sign bit in any Q for-
mat number. Thus, in Q.15 format, the decimal point is placed immediately to
the right of the sign bit. The fractional portion to the right of the sign bit is stored
in regular two’s complement format.
Q.3.12 format places the sign bit after the fourth binary digit from the right, and
the next 12 bits contain the two’s complement fractional component. The
approximate allowable range of numbers in Q.3.12 representation is (−8,8)
and the finest fractional resolution is 2−12 = 2.441 × 10−4.
Bit 15 14 13 12 11 10 9 … 0
Value S I3 I2 I1 Q11 Q10 Q9 … Q0
Q.15 format places the sign bit at the leftmost binary digit, and the next 15 left-
most bits contain the two’s complement fractional component. The approxi-
mate allowable range of numbers in Q.15 representation is (−1,1) and the fin-
est fractional resolution is 2−15 = 3.05 × 10−5.
Bit 15 14 13 12 11 10 9 … 0
Value S Q14 Q13 Q12 Q11 Q10 Q9 … Q0
Q.31 format spans two 16-bit memory words. The 16-bit word stored in the low-
er memory location contains the 16 least significant bits, and the higher
memory location contains the most significant 15 bits and the sign bit. The
approximate allowable range of numbers in Q.31 representation is (−1,1) and
the finest fractional resolution is 2−31 = 4.66 × 10−10.
Bit 15 14 13 12 … 3 2 1 0
Value Q15 Q14 Q13 Q12 … Q3 Q2 Q1 Q0
Bit 15 14 13 12 … 3 2 1 0
Value S Q30 Q29 Q28 … Q19 Q18 Q17 Q16
A-4
Appendix
AppendixBA
Topic Page
B-1
DSPLIB Software Updates
B-2
Appendix
AppendixCA
Glossary
A
address: The location of program code or data stored; an individually
accessible memory location.
assert: To make a digital logic device pin active. If the pin is active low, then
a low voltage on the pin asserts it. If the pin is active high, then a high
voltage asserts it.
B
bit: A binary digit, either a 0 or 1.
big endian: An addressing protocol in which bytes are numbered from left
to right within a word. More significant bytes in a word have lower
numbered addresses. Endian ordering is specific to hardware and is
determined at reset. See also little endian.
block: The three least significant bits of the program address. These
correspond to the address within a fetch packet of the first instruction
being addressed.
C-1
Glossary
C
cache: A fast storage buffer in the central processing unit of a computer.
cache controller: System component that coordinates program accesses
between CPU program fetch mechanism, cache, and external memory.
CCS: Code Composer Studio.
central processing unit (CPU): The portion of the processor involved in
arithmetic, shifting, and Boolean logic operations, as well as the
generation of data- and program-memory addresses. The CPU includes
the central arithmetic logic unit (CALU), the multiplier, and the auxiliary
register arithmetic unit (ARAU).
chip support library (CSL): The CSL is a set of application programming
interfaces (APIs) consisting of target side DSP code used to configure
and control all on-chip peripherals.
clock cycle: A periodic or sequence of events based on the input from the
external clock.
clock modes: Options used by the clock generator to change the internal
CPU clock frequency to a fraction or multiple of the frequency of the input
clock signal.
code: A set of instructions written to perform a task; a computer program or
part of a program.
coder-decoder or compression/decompression (codec): A device that
codes in one direction of transmission and decodes in another direction
of transmission.
compiler: A computer program that translates programs in a high-level
language into their assembly-language equivalents.
C-2
Glossary
control register: A register that contains bit fields that define the way a
device operates.
D
device ID: Configuration register that identifies each peripheral component
interconnect (PCI).
DMA source: The module where the DMA data originates. DMA data is read
from the DMA source.
DMA transfer: The process of transferring data from one part of memory to
another. Each DMA transfer consists of a read bus cycle (source to DMA
holding register) and a write bus cycle (DMA holding register to
destination).
DSP_autocor: Autocorrelation
Glossary C-3
Glossary
DSP_fft16x32: Complex forward mixed radix 16- x 32-bit FFT with rounding.
DSP_fft32x32: Complex forward mixed radix 32- x 32-bit FFT with rounding.
DSP_fft32x32s: Complex forward mixed radix 32- x 32-bit FFT with scaling.
C-4
Glossary
E
evaluation module (EVM): Board and software tools that allow the user to
evaluate a specific device.
external interrupt: A hardware interrupt triggered by a specific value on a
pin.
external memory interface (EMIF): Microprocessor hardware that is used
to read to and write from off-chip memory.
F
fast Fourier transform (FFT): An efficient method of computing the discrete
Fourier transform algorithm, which transforms functions between the
time domain and the frequency domain.
fetch packet: A contiguous 8-word series of instructions fetched by the CPU
and aligned on an 8-word boundary.
FFT: See fast fourier transform.
flag: A binary status indicator whose state indicates whether a particular
condition has occurred or is in effect.
frame: An 8-word space in the cache RAMs. Each fetch packet in the cache
resides in only one frame. A cache update loads a frame with the
requested fetch packet. The cache contains 512 frames.
G
global interrupt enable bit (GIE): A bit in the control status register (CSR)
that is used to enable or disable maskable interrupts.
Glossary C-5
Glossary
H
HAL: Hardware abstraction layer of the CSL. The HAL underlies the service
layer and provides it a set of macros and constants for manipulating the
peripheral registers at the lowest level. It is a low-level symbolic interface
into the hardware providing symbols that describe peripheral
registers/bitfields and macros for manipulating them.
host: A device to which other devices (peripherals) are connected and that
generally controls those devices.
host port interface (HPI): A parallel interface that the CPU uses to
communicate with a host processor.
HPI: See host port interface; see also HPI module.
I
index: A relative offset in the program address that specifies which of the
512 frames in the cache into which the current access is mapped.
indirect addressing: An addressing mode in which an address points to
another pointer rather than to the actual data; this mode is prohibited in
RISC architecture.
instruction fetch packet: A group of up to eight instructions held in memory
for execution by the CPU.
internal interrupt: A hardware interrupt caused by an on-chip peripheral.
interrupt: A signal sent by hardware or software to a processor requesting
attention. An interrupt tells the processor to suspend its current
operation, save the current task status, and perform a particular set of
instructions. Interrupts communicate with the operating system and
prioritize tasks to be performed.
interrupt service fetch packet (ISFP): A fetch packet used to service
interrupts. If eight instructions are insufficient, the user must branch out
of this block for additional interrupt service. If the delay slots of the branch
do not reside within the ISFP, execution continues from execute packets
in the next fetch packet (the next ISFP).
interrupt service routine (ISR): A module of code that is executed in
response to a hardware or software interrupt.
interrupt service table (IST) A table containing a corresponding entry for
each of the 16 physical interrupts. Each entry is a single-fetch packet and
has a label associated with it.
C-6
Glossary
L
least significant bit (LSB): The lowest-order bit in a word.
linker: A software tool that combines object files to form an object module,
which can be loaded into memory and executed.
little endian: An addressing protocol in which bytes are numbered from right
to left within a word. More significant bytes in a word have
higher-numbered addresses. Endian ordering is specific to hardware
and is determined at reset. See also big endian.
M
maskable interrupt: A hardware interrupt that can be enabled or disabled
through software.
memory map: A graphical representation of a computer system’s memory,
showing the locations of program space, data space, reserved space,
and other memory-resident elements.
memory-mapped register: An on-chip register mapped to an address in
memory. Some memory-mapped registers are mapped to data memory,
and some are mapped to input/output memory.
most significant bit (MSB): The highest order bit in a word.
m-law companding: See compress and expand (compand).
multichannel buffered serial port (McBSP): An on-chip full-duplex circuit
that provides direct serial communication through several channels to
external serial devices.
multiplexer: A device for selecting one of several available signals.
N
nonmaskable interrupt (NMI): An interrupt that can be neither masked nor
disabled.
Glossary C-7
Glossary
O
object file: A file that has been assembled or linked and contains machine
language object code.
P
peripheral: A device connected to and usually controlled by a host device.
R
random-access memory (RAM): A type of memory device in which the
individual locations can be accessed in any order.
reset: A means of bringing the CPU to a known state by setting the registers
and control bits to predetermined values and signaling execution to start
at a specified address.
C-8
Glossary
S
service layer: The top layer of the 2-layer chip support library architecture
providing high-level APIs into the CSL and BSL. The service layer is
where the actual APIs are defined and is the layer the user interfaces to.
system software: The blanketing term used to denote collectively the chip
support libraries and board support libraries.
T
tag: The 18 most significant bits of the program address. This value
corresponds to the physical address of the fetch packet that is in that
frame.
TIMER module: TIMER is an API module used for configuring the timer
registers.
W
word: A multiple of eight bits that is operated upon as a unit. For the C6x,
a word is 32 bits in length.
Glossary C-9
Index
Index
C DSP_bitrev_cplx
defined C-3
DSPLIB reference 4-6
cache, defined C-2
DSP_blk_eswap16, defined C-3
cache controller, defined C-2
DSP_blk_eswap32, defined C-3
CCS, defined C-2
DSP_blk_eswap64, defined C-3
central processing unit (CPU), defined C-2
DSP_blk_move
chip support library, defined C-2
defined C-4
clock cycle, defined C-2 DSPLIB reference 4-91, 4-93, 4-95, 4-97
clock modes, defined C-2 DSP_dotp_sqr
code, defined C-2 defined C-4
coder-decoder, defined C-2 DSPLIB reference 4-71
Index-1
Index
Index-2
Index
Index-3
Index
H N
HAL, defined C-6 nonmaskable interrupt (NMI), defined C-7
host, defined C-6
host port interface (HPI), defined C-6 O
HPI, defined C-6
object file, defined C-8
off chip, defined C-8
I on chip, defined C-8
include directory 2-3 overflow and scaling 2-6, 2-7
index, defined C-6
indirect addressing, defined C-6 P
installing DSPLIB 2-2
performance considerations A-2
instruction fetch packet, defined C-6
peripheral, defined C-8
internal interrupt, defined C-6 program cache, defined C-8
internal peripherals, defined C-7 program memory, defined C-8
interrupt, defined C-6 PWR, defined C-8
interrupt service fetch packet (ISFP), defined C-6 PWR module, defined C-8
interrupt service routine (ISR), defined C-6
interrupt service table (IST), defined C-6
IST, defined C-7
Q
Q.3.12 bit fields A-3
L Q.3.12 format A-3
Q.3.15 bit fields A-3
least significant bit (LSB), defined C-7 Q.3.15 format A-3
lib directory 2-2 Q.31 format A-4
linker, defined C-7 Q.31 high-memory location bit fields A-4
little endian, defined C-7 Q.31 low-memory location bit fields A-4
Index-4
Index
S U
service layer, defined C-9 using DSPLIB 2-4
software updates B-2
STDINC module, defined C-9
synchronous-burst static random-access memory
W
(SBSRAM), defined C-9 word, defined C-9
Index-5