TMS320C62X Assembly Benchmarks at Texas Instruments

>> Semiconductor Home > Products > Digital Signal Processors > DSP Overview > TMS320C6000 Platform Overview >

   TMS320C6000 Platform Overview

> 'C6000 Platform Summary
> 'C6000 VelociTI™
   Architecture
> 'C6000 Applications
     more...
> 'C6000 Development Tools
> Technical Documentation
   Search
> 'C6000 Application Notes
> 'C6000 Training
> 'C6000 Benchmarks
> 'C67x Floating-Point
   Benchmarks
> DSP References
> 'C62x Fixed-Point DSPs
> 'C67x Floating-Point DSPs


'C6000 Benchmarks
         Filters
         Vector
         FFTs
         Search
         Math
         Graphics
         Telecom

FILTERS

Benchmark Description Formula

FIR-coefficients a multiple of 4 This FIR assumes the number of filter coefficients is a multiple of 4 and the number of output samples is a multiple of 2. It operates on 16-bit data with a 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N coefficients. M*(N+8)/2 + 6
For N=32 and M=100
2006 cycles or 10.03 �sec

FIR-coefficients a multiple of 8 This FIR assumes the number of filter coeficients is a multiple of 8 and the number of output samples is a multiple of 2. It operates on 16-bit data with a 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N h coefficients. M*N/2 + 13
For N=32 and M=100
1613 cycles or 8.06 �sec

Complex FIR FIR operates on complex 16-bit data with a complex 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N coefficients. 2*M*N + 10
For M = 100 and N = 32:
6410 cycles or 32 �sec

LMS FIR - coefficients a multiple of 2 Least Mean Square Adaptive Filter. Computes an update of all N coeficients by adding the weighted error times the inputs to the original coefficients. This assumes single sample input followed by the last N-1 inputs and N coefficients. 1.5*N+16
For N=30
61 cycles or 305 nsec

LMS FIR - coefficients a multiple of 8 Least Mean Square Adaptive Filter. Computes an update of all N coeficients by adding the weighted error times the inputs to the original coefficients followed by an FIR with N coefficients and M output samples and an error calculation. This assumes that N is a multiple of 8. (N=number of data samples, multiple of 8 >=8) M*(9/8*N+15)+5

IIR filter Performs an Auto-regressive moving-average (ARMA) filter with 4 auto-regressive filter coefficients and 5 moving-average filter coefficients for M output samples. Output vector is stored to two locations. This routined is used as a high pass filter in the VSELP vocoder. (M*5 + 16)
For M = 160:
816 cycles or 4.08 �sec

FIR Circular Finite Impulse Response Filter. Uses circular addressing with initial index. Performs filtering 2 samples at a time.
(N=number of data samples, even >=2)
(M=number of filter coefficients, multiple of 4 >=4) M*(N+11)/2+13
For N=32 and M=32
701 cycles or 3.505 �sec

Lattice Analysis Lattice Filter - Inverse - Analysis.
(N=number of coefficients) 1.5*N+10
For N=10
25 cycles or 125 nsec

Lattice Synthesis Lattice Filter - Forward - Synthesis.
(N=number of data samples, even >= 6) 2N+18
For N=10
38 cycles or 190 nsec

IIR with 4 biquads cascaded Infinite Impulse Response Filter. Direct Form II - 4 Multiplies. Processes 2 samples at a time. (N=number of cascaded biquads) 4N+16
For N=10
56 cycles or 280 nsec

Autocorrelation Performs autocorrelation of a 16-bit vector. Nested loop with M inner loop multiply accumulates and outer loops. (N/2) *M + 16 + M/4
For N=160 and M=10;
816 cycles or 4.08 �sec

Return to top

Benchmark	Description	Formula
FIR-coefficients a multiple of 4	This FIR assumes the number of filter coefficients is a multiple of 4 and the number of output samples is a multiple of 2. It operates on 16-bit data with a 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N coefficients.	M*(N+8)/2 + 6 For N=32 and M=100 2006 cycles or 10.03 �sec
FIR-coefficients a multiple of 8	This FIR assumes the number of filter coeficients is a multiple of 8 and the number of output samples is a multiple of 2. It operates on 16-bit data with a 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N h coefficients.	M*N/2 + 13 For N=32 and M=100 1613 cycles or 8.06 �sec
Complex FIR	FIR operates on complex 16-bit data with a complex 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N coefficients.	2MN + 10 For M = 100 and N = 32: 6410 cycles or 32 �sec
LMS FIR - coefficients a multiple of 2	Least Mean Square Adaptive Filter. Computes an update of all N coeficients by adding the weighted error times the inputs to the original coefficients. This assumes single sample input followed by the last N-1 inputs and N coefficients.	1.5*N+16 For N=30 61 cycles or 305 nsec
LMS FIR - coefficients a multiple of 8	Least Mean Square Adaptive Filter. Computes an update of all N coeficients by adding the weighted error times the inputs to the original coefficients followed by an FIR with N coefficients and M output samples and an error calculation. This assumes that N is a multiple of 8. (N=number of data samples, multiple of 8 >=8)	M(9/8N+15)+5
IIR filter	Performs an Auto-regressive moving-average (ARMA) filter with 4 auto-regressive filter coefficients and 5 moving-average filter coefficients for M output samples. Output vector is stored to two locations. This routined is used as a high pass filter in the VSELP vocoder.	(M*5 + 16) For M = 160: 816 cycles or 4.08 �sec
FIR Circular	Finite Impulse Response Filter. Uses circular addressing with initial index. Performs filtering 2 samples at a time. (N=number of data samples, even >=2) (M=number of filter coefficients, multiple of 4 >=4)	M*(N+11)/2+13 For N=32 and M=32 701 cycles or 3.505 �sec
Lattice Analysis	Lattice Filter - Inverse - Analysis. (N=number of coefficients)	1.5*N+10 For N=10 25 cycles or 125 nsec
Lattice Synthesis	Lattice Filter - Forward - Synthesis. (N=number of data samples, even >= 6)	2N+18 For N=10 38 cycles or 190 nsec
IIR with 4 biquads cascaded	Infinite Impulse Response Filter. Direct Form II - 4 Multiplies. Processes 2 samples at a time. (N=number of cascaded biquads)	4N+16 For N=10 56 cycles or 280 nsec
Autocorrelation	Performs autocorrelation of a 16-bit vector. Nested loop with M inner loop multiply accumulates and outer loops.	(N/2) *M + 16 + M/4 For N=160 and M=10; 816 cycles or 4.08 �sec

VECTOR

Benchmark Description Formula

dot product Dot product of two vectors of length N N/2 + 8
For N = 100
58 cycles or 290 nsec

Weighted vector sum Performs an N element vector sum of two vectors with one vector weighted by constant. The result is stored in a third vector. N+10
For N = 40:
49 cycles or 245 nsec

Vector dot product and square Performs an N element dot product and each of the N elements of one of the vectors is squared and accumulated. This is used to compute G in the VSELP coder. N + 8
For N = 40:
48 cycles or 240 nsec

Block move Move N 16-bit elements from one memory location to another. N/2 + 5
For N = 40:
25 cycles or 125 nsec

Sum of squares Each of N elements in a vector is squared and accumulated. This particular loop is used to compute Gl in the VSELP vocoder codebook search. (N-1)/2 + 9
For N = 21:
19 cycles

Return to top

Benchmark	Description	Formula
dot product	Dot product of two vectors of length N	N/2 + 8 For N = 100 58 cycles or 290 nsec
Weighted vector sum	Performs an N element vector sum of two vectors with one vector weighted by constant. The result is stored in a third vector.	N+10 For N = 40: 49 cycles or 245 nsec
Vector dot product and square	Performs an N element dot product and each of the N elements of one of the vectors is squared and accumulated. This is used to compute G in the VSELP coder.	N + 8 For N = 40: 48 cycles or 240 nsec
Block move	Move N 16-bit elements from one memory location to another.	N/2 + 5 For N = 40: 25 cycles or 125 nsec
Sum of squares	Each of N elements in a vector is squared and accumulated. This particular loop is used to compute Gl in the VSELP vocoder codebook search.	(N-1)/2 + 9 For N = 21: 19 cycles

FFTs

Benchmark Description Formula

Two-level-cache efficient Complex Radix 4 FFT Complex Radix 4 FFT of size N. This FFT uses a redundant sequence (N twiddles for N-point FFT) of twiddle factors to allow a linear access through the data. This linear access consumes the entire contents of a cache line before it uses another one resulting in efficient cache usage. 10 * log4(N) * (0.25 * N + 3) + 22
for N = 1024,
cycles = 12972

Complex Radix 4 FFT Complex Radix 4 FFT of size N Log(base4)N * (10 * N/4 + 33) + 7 + N/4
For N = 1024:
13228 cycles or 66 �sec

Complex Radix 2 FFT Complex Radix 2 FFT of size N Log(base2)N * (4 * N/2 + 7) + 9 + N/4
For N = 1024:
20815 cycles or 104 �sec

Bit Reverse The Bit-Reverse routine performs the bit-reversal of length N on an array of 16-bit complex data length N. Cycle Count: 7*(N/4 + 2) + 14
For N = 1024 Cycle Count = 1820 or 9.1�s
Lookup Table Size: 32 Halfwords (64 Bytes)

Return to top

Benchmark	Description	Formula
Two-level-cache efficient Complex Radix 4 FFT	Complex Radix 4 FFT of size N. This FFT uses a redundant sequence (N twiddles for N-point FFT) of twiddle factors to allow a linear access through the data. This linear access consumes the entire contents of a cache line before it uses another one resulting in efficient cache usage.	10 * log4(N) * (0.25 * N + 3) + 22 for N = 1024, cycles = 12972
Complex Radix 4 FFT	Complex Radix 4 FFT of size N	Log(base4)N * (10 * N/4 + 33) + 7 + N/4 For N = 1024: 13228 cycles or 66 �sec
Complex Radix 2 FFT	Complex Radix 2 FFT of size N	Log(base2)N * (4 * N/2 + 7) + 9 + N/4 For N = 1024: 20815 cycles or 104 �sec
Bit Reverse	The Bit-Reverse routine performs the bit-reversal of length N on an array of 16-bit complex data length N.	Cycle Count: 7*(N/4 + 2) + 14 For N = 1024 Cycle Count = 1820 or 9.1�s Lookup Table Size: 32 Halfwords (64 Bytes)

SEARCH

Benchmark Description Formula

Minimum energy error search Performs a dot product on 256 pairs of 9 element vectors and searches for the pair of vectors which produces the maximum dot product result. This is a large part of the VSELP vocoder codebook search. (256/2)*9 + 14
1166 cycles or 5.83 �sec

Vector Max Finds the maximum value in a vector of length N. N/2 + 13
For N = 100:
64 cycles or 320 nsec

Vector Max Index Finds the maximum value in a vector of length N and stores the index of that location. 2N/3 + 12
For N = 100:
79 cycles or 395 nsec

codebook search for VSELP Performs VSELP vocoder codebook search. The C source code for this was written by Motorola Systems Research Laboratories and is authorized by Motorola for the use of development of North American digital cellular standards. As such, the C code cannot be shown here. This routine performs the entire v_srch.c function as written by Motorola. It involves calculating correlations between weighted basis vectors and weighted speech vector (Rm's), C0, and 0.25 * sum of Djj for G0. It then calculates all Dmj and finishes calculating G0. It then initializes the best vector to be code vector zero and performs search by finding the vector that produces the highest C^2/G value.
Loop1 Loop2 Loop3

342 + 639 + 2087 = 3068cycles

Return to top

Benchmark	Description	Formula
Minimum energy error search	Performs a dot product on 256 pairs of 9 element vectors and searches for the pair of vectors which produces the maximum dot product result. This is a large part of the VSELP vocoder codebook search.	(256/2)*9 + 14 1166 cycles or 5.83 �sec
Vector Max	Finds the maximum value in a vector of length N.	N/2 + 13 For N = 100: 64 cycles or 320 nsec
Vector Max Index	Finds the maximum value in a vector of length N and stores the index of that location.	2N/3 + 12 For N = 100: 79 cycles or 395 nsec
codebook search for VSELP	Performs VSELP vocoder codebook search. The C source code for this was written by Motorola Systems Research Laboratories and is authorized by Motorola for the use of development of North American digital cellular standards. As such, the C code cannot be shown here. This routine performs the entire v_srch.c function as written by Motorola. It involves calculating correlations between weighted basis vectors and weighted speech vector (Rm's), C0, and 0.25 * sum of Djj for G0. It then calculates all Dmj and finishes calculating G0. It then initializes the best vector to be code vector zero and performs search by finding the vector that produces the highest C^2/G value.	Loop1 Loop2 Loop3 342 + 639 + 2087 = 3068cycles

MATH

Benchmark Description Formula

ADD40 Adds two 40-bit values to produce a 40-bit result. This code sample is not a complete N/A

ADD64 Adds two 64-bit values to produce a 64-bit result. This code sample is not a complete function! N/A

SUB40 Subtracts one 40-bit value from another 40-bit value to produce a 40-bit result. This code sample is NOT a complete function! N/A

SUB64 Subtracts one 64-bit value from another 64-bit value to produce a 64-bit result. This code sample is NOT a complete function! N/A

DIVMOD32 This routine divides two 32 bit values and returns their quotient and remainder. The inputs are 32-bit numbers, and the result is a 32-bit number. Cycles (Min execution 16 cycles, Max execution 41 cycles). This code sample is NOT a complete function! N/A

DIVMODU32 This routine divides two unsigned 32 bit values and returns their quotient and remainder. The inputs are unsigned 32-bit numbers, and the result is a unsigned 32-bit number. Cycles (Min execution 18 cycles, Max execution 42 cycles) This code sample is NOT a complete function! N/A

MPY32 This routine takes two 32 bit integer values and calculates their product. The inputs are 32-bit integer, and the result is a 32-bit integer. Cycles (See routine) put the note. This code sample is NOT a complete function! N/A

MPY3240 This routine takes two 32 bit integer values and calculates their product. The inputs are 32-bit integer, and the result is a 40-bit integer. Cycles (See routine)This code sample is NOT a complete function! N/A

MPYU3240 This routine takes two 32 bit unsigned integer values and calculates their product. The inputs are 32-bit unsigned integer, and the result is a 40-bit unsigned integer.Cycles (See routine)This code sample is NOT a complete function! N/A

MPY40 This routine takes two 40 bit integer values and calculates their product. The inputs are 40-bit integer, and the result is a 40-bit integer. Cycles (See routine)This code sample is NOT a complete function! N/A

MPY3264 This routine takes two 32 bit integer values and calculates
their product. The inputs are 32-bit integer, and the result is a 64-bit
integer.
Cycles (See routine)
N/A

MPYU3264 This routine takes two 32 bit unsigned integer values and
calculates their product. The inputs are 32-bit unsigned integers, and
the result is a 64-bit unsigned integer.
Cycles (See routine) N/A

Return to top

Benchmark	Description	Formula
ADD40	Adds two 40-bit values to produce a 40-bit result. This code sample is not a complete	N/A
ADD64	Adds two 64-bit values to produce a 64-bit result. This code sample is not a complete function!	N/A
SUB40	Subtracts one 40-bit value from another 40-bit value to produce a 40-bit result. This code sample is NOT a complete function!	N/A
SUB64	Subtracts one 64-bit value from another 64-bit value to produce a 64-bit result. This code sample is NOT a complete function!	N/A
DIVMOD32	This routine divides two 32 bit values and returns their quotient and remainder. The inputs are 32-bit numbers, and the result is a 32-bit number. Cycles (Min execution 16 cycles, Max execution 41 cycles). This code sample is NOT a complete function!	N/A
DIVMODU32	This routine divides two unsigned 32 bit values and returns their quotient and remainder. The inputs are unsigned 32-bit numbers, and the result is a unsigned 32-bit number. Cycles (Min execution 18 cycles, Max execution 42 cycles) This code sample is NOT a complete function!	N/A
MPY32	This routine takes two 32 bit integer values and calculates their product. The inputs are 32-bit integer, and the result is a 32-bit integer. Cycles (See routine) put the note. This code sample is NOT a complete function!	N/A
MPY3240	This routine takes two 32 bit integer values and calculates their product. The inputs are 32-bit integer, and the result is a 40-bit integer. Cycles (See routine)This code sample is NOT a complete function!	N/A
MPYU3240	This routine takes two 32 bit unsigned integer values and calculates their product. The inputs are 32-bit unsigned integer, and the result is a 40-bit unsigned integer.Cycles (See routine)This code sample is NOT a complete function!	N/A
MPY40	This routine takes two 40 bit integer values and calculates their product. The inputs are 40-bit integer, and the result is a 40-bit integer. Cycles (See routine)This code sample is NOT a complete function!	N/A
MPY3264	This routine takes two 32 bit integer values and calculates their product. The inputs are 32-bit integer, and the result is a 64-bit integer. Cycles (See routine)	N/A
MPYU3264	This routine takes two 32 bit unsigned integer values and calculates their product. The inputs are 32-bit unsigned integers, and the result is a 64-bit unsigned integer. Cycles (See routine)	N/A

GRAPHICS

Benchmark Description Formula

8x8 Block IDCT - IEEE-1180 Compliant The idct_8x8 algorithm performs an IEEE-1180 compliant IDCT, complete with rounding and saturation to signed 9-bit quantities. The array should be aligned to a 32-bit boundary, and be laid out equivalently to the C array idct_data[num_idcts+1][8][8]. The input coefficients are assumed to be signed 12-bit cosine terms. Cycles = 62 + 168 * num_idcts for num_idcts >= 1
230 cycles or 1.15 �s for one 8x8 Block of Data

8x8 Block FDCT With Rounding The fdct routine accepts a list of 8x8 pixel blocks and performs FDCTs on each. The array should be laid out equivalently to the C array dct_data[num_fdcts+1][8][8]. All operations in this array are performed entirely in-place. Input values are stored in shorts, and may be in the range [-512,511]. Input terms are expected to be signed 11Q0 values, producing signed 15Q0 results. Cycles = 48 + 160 * num_fdcts for num_fdcts >= 1
208 cycles or 1.04 �s for one 8x8 Block of Data

Gouraud Gouraud Shading of a scanline of pixels. Four pixels of a line at a time are processed. (N=pixels >=4, multiple of 4 pixels) 2N+7
For 1024 pixels taken 4 pixels at a time
2055 cycles or 10.275 �sec

Return to top

Benchmark	Description	Formula
8x8 Block IDCT - IEEE-1180 Compliant	The idct_8x8 algorithm performs an IEEE-1180 compliant IDCT, complete with rounding and saturation to signed 9-bit quantities. The array should be aligned to a 32-bit boundary, and be laid out equivalently to the C array idct_data[num_idcts+1][8][8]. The input coefficients are assumed to be signed 12-bit cosine terms.	Cycles = 62 + 168 * num_idcts for num_idcts >= 1 230 cycles or 1.15 �s for one 8x8 Block of Data
8x8 Block FDCT With Rounding	The fdct routine accepts a list of 8x8 pixel blocks and performs FDCTs on each. The array should be laid out equivalently to the C array dct_data[num_fdcts+1][8][8]. All operations in this array are performed entirely in-place. Input values are stored in shorts, and may be in the range [-512,511]. Input terms are expected to be signed 11Q0 values, producing signed 15Q0 results.	Cycles = 48 + 160 * num_fdcts for num_fdcts >= 1 208 cycles or 1.04 �s for one 8x8 Block of Data
Gouraud	Gouraud Shading of a scanline of pixels. Four pixels of a line at a time are processed. (N=pixels >=4, multiple of 4 pixels)	2N+7 For 1024 pixels taken 4 pixels at a time 2055 cycles or 10.275 �sec

TELECOM

Benchmark Description Formula

Viterbi Equalization Viterbi Equalizer - GSM (N=number of data points) 43N + 2
For N=120
5162 cycles or 25.810 �sec

Viterbi GSM Viterbi Channel Decoder (GSM) (N=number of data points) 38N + 12 + N/4
For N=189
7242 cycles or 36.21 �sec

Viterbi IS54 Viterbi Channel Decoder (IS54)
(N=number of data points) 66.5*N+16
For N=189
5934 cycles or 29.67�sec

Viterbi V.32 Viterbi V.32 PSTN Trellis Decoder. (N=number of data points) 64 cycles or 320nsec

Return to top

		TMS320C6000 Platform Overview
> 'C6000 Platform Summary > 'C6000 VelociTI™ Architecture > 'C6000 Applications more... > 'C6000 Development Tools > Technical Documentation Search > 'C6000 Application Notes > 'C6000 Training > 'C6000 Benchmarks > 'C67x Floating-Point Benchmarks > DSP References > 'C62x Fixed-Point DSPs > 'C67x Floating-Point DSPs		'C6000 Benchmarks Filters Vector FFTs Search Math Graphics Telecom

Benchmark	Description	Formula
Viterbi Equalization	Viterbi Equalizer - GSM (N=number of data points)	43N + 2 For N=120 5162 cycles or 25.810 �sec
Viterbi GSM	Viterbi Channel Decoder (GSM) (N=number of data points)	38N + 12 + N/4 For N=189 7242 cycles or 36.21 �sec
Viterbi IS54	Viterbi Channel Decoder (IS54) (N=number of data points)	66.5*N+16 For N=189 5934 cycles or 29.67�sec
Viterbi V.32	Viterbi V.32 PSTN Trellis Decoder. (N=number of data points)	64 cycles or 320nsec