TMS320C67X Assembly Benchmarks at Texas Instruments

>> Semiconductor Home > Products > Digital Signal Processors > DSP Overview > TMS320C6000 Platform Overview >

   TMS320C6000™ Highest Performance DSP Platform

> Platform Summary
> VelociTI™ Architecture
> Applications
> Development Tools
> Technical Documentation
   Search
> Platform Benchmarks
   > C62x DSPs
   > C64x DSPs
   > C67x DSPs
   > C6000 Compiler
      Benchmarks

> C62x™ Fixed-Point DSPs
> C67x™ Floating-Point DSPs

> C6000 Compiler
> MultiChannel Vocoder
   Technology Design Kit
> Foundation Software
> Training
> DSP References

Click here to view C6000 roadmap

C67x™ Floating-Point Benchmarks
         Filters
         Vector
         FFTs
         Search
         Math
         3D-Graphics and Imaging

FILTERS

Benchmark Description Formula

Block FIR The FIR assumes that the number of filter coefficients (numH) is a multiple of 2 and greater than or equal to 4 and the number of outputs (numY) is a multiple of 4 and greater than or equal to 4. The input, output, and coefficient arrays must start on the same double-word boundary to avoid memory bank hits. ((2*numH)+10)*(numY/4)+8
For numH=64 and numY=64
2216 cycles or 13.296 µsec

Block IIR The IIR assumes that the order is a multiple of 2 and greater than or equal to 4, and the number of outputs (numY) is a multiple of 2 and greater than or equal to order+2. To avoid bank hits, the input and output arrays must be aligned on opposite double-word boundaries, and the a and b coefficient arrays must be aligned on opposite double-word boundaries. (order+10)*(numY-order)+15
For order=16 and numY=64
1263 cycles or 7.578 µsec

Cascaded IIR Biquads The Biquad assumes that the number of biquads (numB) is a multiple of 2 and greater than or equal to 2, and it processes one input and produces one output. There are no memory bank hits regardless of where the arguments are placed in memory. 4*(numB)+29
For numB=8
61 cycles or 366 nsec

Circular Block FIR The circular FIR assumes that the number of filter coefficients (hsize) is a multiple of 2 and greater than or equal to 4 and the number of outputs (ysize) is a multiple of 4 and greater than or equal to 4. The input, output, and coefficient arrays must start on the same double-word boundary to avoid memory bank hits. Circular addressing is used for the input array (x) with a circular buffer size 2^(size+1) and the routine uses "index" to define the initial offset into the buffer. ((2*hsize)+10)*(ysize/4)+9
For hsize=64 and ysize=64
2217 cycles or 13.302 µsec

Convolution The convolution assumes that the output array length (nr) is a multiple of 4 and greater than or equal to 4, and the second input array length (nb) is a multiple of 2 and greater than or equal to 4. The first input array length should be (nr+nb-1) where the first nb-1 and last nb-1 values are zero. If all three arrays are aligned on the same double-word boundary and nb is not a multiple of 4 there will be no memory bank hits (if it is a multiple of 4 there will be nr/4 bank hits). (nb/2)*nr+(nr/2)*5+8
For nb=8 and nr=20
138 cycles or 828 nsec

Cross Correlation The Correlation assumes that the output array length (nr) is a multiple of 4 and greater than or equal to 4, and the second input array length (nb) is a multiple of 2 and greater than or equal to 4. The first input array length should be (nr+nb-1) where the first nb-1 and last nb-1 values are zero. If all three arrays are aligned on the same double-word boundary and nb is not a multiple of 4 there will be no memory bank hits (if it is a multiple of 4 there will be nr/4 bank hits). (nb/2)*nr+(nr/2)*5+8
For nb=8 and nr=20
138 cycles or 828 nsec

Autocorrelation Autocorrelation assumes that the correlation is length M, the output array is length M and the input array is length (M+N) where the first M values are zero. The value of N should be a multiple of 2 and greater than or equal to 4. The value of M should be a multiple of 4 and greater than or equal to 4. To prevent memory bank hits, the input array should be aligned on an even double-word boundary (bank 0), and the output array should be aligned on the next word boundary (bank 2). (N/2)*M+(M/2)*5+9
For M=8 and N=18
101 cycles or 606 nsec

LMS FIR Filter The Least Mean Squares adaptive FIR filter assumes that the number of coefficients (numH) is a multiple of 4 and at least 4. The number of inputs must be equal to numH+numY-1, where numY is the number of outputs. ((5*numH)/4+27)*numY+17
For numH=64 and numY=64
6865 cycles or 41.19 µsec

Complex FIR Filter The complex FIR filter assumes that the number of complex coefficients (numH) is a multiple of 2 and at least 4. The number of complex inputs must be equal to numH+numY-1, where numY is the number of complex outputs. ((2*numH)+14)*numY+17+numY-1
For numH=64 and numY=64
9168 cycles or 55.008 µsec

Inverse Analysis Lattice Filter This routine implements an inverse analysis lattice filter (FIR filter or IIR filter with no poles) and stores the result in f. The filter consists of n stages. The value of f is calculated by doing a multiply accumulate on the backward error coefficients, b, and filter gains, k. New backward error coefficients are also calculated. 4*n+22
For n=8
54 cycles or 324 nsec

Forward Synthesis Lattice Filter This routine implements a forward synthesis lattice filter (IIR filter with no zeros) and stores the result in f. The filter consists of n stages. The value of f is calculated by doing a multiply accumulate on the backward error coefficients, b, and filter gains, k. New backward error coefficients are also calculated. The value of n must be at least 4. 4*n+24
For n=8
56 cycles or 336 nsec

Return to top

Benchmark	Description	Formula
Block FIR	The FIR assumes that the number of filter coefficients (numH) is a multiple of 2 and greater than or equal to 4 and the number of outputs (numY) is a multiple of 4 and greater than or equal to 4. The input, output, and coefficient arrays must start on the same double-word boundary to avoid memory bank hits.	((2numH)+10)(numY/4)+8 For numH=64 and numY=64 2216 cycles or 13.296 µsec
Block IIR	The IIR assumes that the order is a multiple of 2 and greater than or equal to 4, and the number of outputs (numY) is a multiple of 2 and greater than or equal to order+2. To avoid bank hits, the input and output arrays must be aligned on opposite double-word boundaries, and the a and b coefficient arrays must be aligned on opposite double-word boundaries.	(order+10)*(numY-order)+15 For order=16 and numY=64 1263 cycles or 7.578 µsec
Cascaded IIR Biquads	The Biquad assumes that the number of biquads (numB) is a multiple of 2 and greater than or equal to 2, and it processes one input and produces one output. There are no memory bank hits regardless of where the arguments are placed in memory.	4*(numB)+29 For numB=8 61 cycles or 366 nsec
Circular Block FIR	The circular FIR assumes that the number of filter coefficients (hsize) is a multiple of 2 and greater than or equal to 4 and the number of outputs (ysize) is a multiple of 4 and greater than or equal to 4. The input, output, and coefficient arrays must start on the same double-word boundary to avoid memory bank hits. Circular addressing is used for the input array (x) with a circular buffer size 2^(size+1) and the routine uses "index" to define the initial offset into the buffer.	((2hsize)+10)(ysize/4)+9 For hsize=64 and ysize=64 2217 cycles or 13.302 µsec
Convolution	The convolution assumes that the output array length (nr) is a multiple of 4 and greater than or equal to 4, and the second input array length (nb) is a multiple of 2 and greater than or equal to 4. The first input array length should be (nr+nb-1) where the first nb-1 and last nb-1 values are zero. If all three arrays are aligned on the same double-word boundary and nb is not a multiple of 4 there will be no memory bank hits (if it is a multiple of 4 there will be nr/4 bank hits).	(nb/2)nr+(nr/2)5+8 For nb=8 and nr=20 138 cycles or 828 nsec
Cross Correlation	The Correlation assumes that the output array length (nr) is a multiple of 4 and greater than or equal to 4, and the second input array length (nb) is a multiple of 2 and greater than or equal to 4. The first input array length should be (nr+nb-1) where the first nb-1 and last nb-1 values are zero. If all three arrays are aligned on the same double-word boundary and nb is not a multiple of 4 there will be no memory bank hits (if it is a multiple of 4 there will be nr/4 bank hits).	(nb/2)nr+(nr/2)5+8 For nb=8 and nr=20 138 cycles or 828 nsec
Autocorrelation	Autocorrelation assumes that the correlation is length M, the output array is length M and the input array is length (M+N) where the first M values are zero. The value of N should be a multiple of 2 and greater than or equal to 4. The value of M should be a multiple of 4 and greater than or equal to 4. To prevent memory bank hits, the input array should be aligned on an even double-word boundary (bank 0), and the output array should be aligned on the next word boundary (bank 2).	(N/2)M+(M/2)5+9 For M=8 and N=18 101 cycles or 606 nsec
LMS FIR Filter	The Least Mean Squares adaptive FIR filter assumes that the number of coefficients (numH) is a multiple of 4 and at least 4. The number of inputs must be equal to numH+numY-1, where numY is the number of outputs.	((5numH)/4+27)numY+17 For numH=64 and numY=64 6865 cycles or 41.19 µsec
Complex FIR Filter	The complex FIR filter assumes that the number of complex coefficients (numH) is a multiple of 2 and at least 4. The number of complex inputs must be equal to numH+numY-1, where numY is the number of complex outputs.	((2numH)+14)numY+17+numY-1 For numH=64 and numY=64 9168 cycles or 55.008 µsec
Inverse Analysis Lattice Filter	This routine implements an inverse analysis lattice filter (FIR filter or IIR filter with no poles) and stores the result in f. The filter consists of n stages. The value of f is calculated by doing a multiply accumulate on the backward error coefficients, b, and filter gains, k. New backward error coefficients are also calculated.	4*n+22 For n=8 54 cycles or 324 nsec
Forward Synthesis Lattice Filter	This routine implements a forward synthesis lattice filter (IIR filter with no zeros) and stores the result in f. The filter consists of n stages. The value of f is calculated by doing a multiply accumulate on the backward error coefficients, b, and filter gains, k. New backward error coefficients are also calculated. The value of n must be at least 4.	4*n+24 For n=8 56 cycles or 336 nsec

VECTOR

Benchmark Description Formula

dot product The function performs the dot product of two vectors of length N where N is a multiple of 2 and greater than or equal to 10. No memory bank hits occur if the arrays are aligned on opposite double-word boundaries. N/2 + 24
For N=100
74 cycles or 444 nsec

Matrix-Vector Multiply (any size) The function performs the multiplication of a n x m matrix by a m x 1 vector. The a and b arrays should be placed on opposite double-word boundaries to prevent memory bank hits. (n+20)*m+1
For m=3 and n=3
70 cycles or 420 nsec

Matrix-Vector Multiply (with even number of columns) The function performs the multiplication of a n x m matrix by a m x 1 vector. The column dimension (m) must be greater than or equal to 2 and a multiple of 2. The a and b arrays should be placed on opposite double-word boundaries to prevent memory bank hits. ((n/2)+24)*m+7
For m=3 and n=20
109 cycles or 654 nsec

Weighted vector sum The function performs an N element vector sum of two vectors with one vector weighted by a constant. The result is stored in a third vector. The value of N must be a multiple of 2 and greater than or equal to 12. To prevent bank hits, the two input vectors should be aligned on opposite double-word boundaries. N+12
For N=100
112 cycles or 672 nsec

Vector Sum The function calculates the sum of two vectors of length N where N is a multiple of 2 and greater than or equal to 6. To avoid memory bank hits, the vectors should be aligned on opposite double-word boundaries. N+8
For N=100
108 cycles or 648 nsec

Sum of squares The function calculates the sum of the squares of the N elements of the vector. The value N must be a multiple of 2 and greater than or equal to 12. This function performs extraneous loads. N/2 + 24
For N=100
74 cycles or 444 nsec

Return to top

Benchmark	Description	Formula
dot product	The function performs the dot product of two vectors of length N where N is a multiple of 2 and greater than or equal to 10. No memory bank hits occur if the arrays are aligned on opposite double-word boundaries.	N/2 + 24 For N=100 74 cycles or 444 nsec
Matrix-Vector Multiply (any size)	The function performs the multiplication of a n x m matrix by a m x 1 vector. The a and b arrays should be placed on opposite double-word boundaries to prevent memory bank hits.	(n+20)*m+1 For m=3 and n=3 70 cycles or 420 nsec
Matrix-Vector Multiply (with even number of columns)	The function performs the multiplication of a n x m matrix by a m x 1 vector. The column dimension (m) must be greater than or equal to 2 and a multiple of 2. The a and b arrays should be placed on opposite double-word boundaries to prevent memory bank hits.	((n/2)+24)*m+7 For m=3 and n=20 109 cycles or 654 nsec
Weighted vector sum	The function performs an N element vector sum of two vectors with one vector weighted by a constant. The result is stored in a third vector. The value of N must be a multiple of 2 and greater than or equal to 12. To prevent bank hits, the two input vectors should be aligned on opposite double-word boundaries.	N+12 For N=100 112 cycles or 672 nsec
Vector Sum	The function calculates the sum of two vectors of length N where N is a multiple of 2 and greater than or equal to 6. To avoid memory bank hits, the vectors should be aligned on opposite double-word boundaries.	N+8 For N=100 108 cycles or 648 nsec
Sum of squares	The function calculates the sum of the squares of the N elements of the vector. The value N must be a multiple of 2 and greater than or equal to 12. This function performs extraneous loads.	N/2 + 24 For N=100 74 cycles or 444 nsec

FFTs

Benchmark Description Formula

Complex Radix 4 FFT The function calculates the complex Radix 4 DIF FFT of size N with digit-reversed output and normal order input. (log4(N))*(14*N/4+23)+20
For N=1024
18,055 cycles or 108.33 µsec

Complex Radix 2 FFT The function calculates the complex Radix 2 DIT FFT of size N with bit-reversed output, and coefficients, and normal order input. ((2*N)+23)*log2 (N)+6
For N=1024
20,716 cycles or 124.30 µsec

Inverse Complex Radix 2 FFT The function calculates the inverse complex Radix 2 DIF FFT of size N with bit-reversed input, normal order output, and bit-reversed coefficients.. ((2*N)+16)*log2(N)+25
For N=1024
20,665 cycles or 124 µsec

Complex Bit-Reverse The function performs the bit-reversal for an array of N complex SP floating point numbers. N must be a power of 2. (N/4)*11+9
For N=1024
2,825 cycles or 16.95 µsec

Two-level-cache efficient mixed-radix forward FFT The function performs a mixed radix forward FFT for floating point input and coefficient data using a special sequence of coefficients. This FFT uses a redundant sequence of twiddle factors to allow a linear access through the data. 3.25 * ceil(log4(N) -1) * N + 3*N + 179
for N = 1024,
cycles = 16,563

Return to top

Benchmark	Description	Formula
Complex Radix 4 FFT	The function calculates the complex Radix 4 DIF FFT of size N with digit-reversed output and normal order input.	(log4(N))(14N/4+23)+20 For N=1024 18,055 cycles or 108.33 µsec
Complex Radix 2 FFT	The function calculates the complex Radix 2 DIT FFT of size N with bit-reversed output, and coefficients, and normal order input.	((2N)+23)log2 (N)+6 For N=1024 20,716 cycles or 124.30 µsec
Inverse Complex Radix 2 FFT	The function calculates the inverse complex Radix 2 DIF FFT of size N with bit-reversed input, normal order output, and bit-reversed coefficients..	((2N)+16)log2(N)+25 For N=1024 20,665 cycles or 124 µsec
Complex Bit-Reverse	The function performs the bit-reversal for an array of N complex SP floating point numbers. N must be a power of 2.	(N/4)*11+9 For N=1024 2,825 cycles or 16.95 µsec
Two-level-cache efficient mixed-radix forward FFT	The function performs a mixed radix forward FFT for floating point input and coefficient data using a special sequence of coefficients. This FFT uses a redundant sequence of twiddle factors to allow a linear access through the data.	3.25 * ceil(log4(N) -1) * N + 3*N + 179 for N = 1024, cycles = 16,563

SEARCH

Benchmark Description Formula

Vector Max The function finds the maximum value in a vector of length N where N is a multiple of 5 and greater than or equal to 10. No memory bank hits occur regardless of where arguments are in memory. 3*N/5+14
For N=100
74 cycles or 444 nsec

Return to top

Benchmark	Description	Formula
Vector Max	The function finds the maximum value in a vector of length N where N is a multiple of 5 and greater than or equal to 10. No memory bank hits occur regardless of where arguments are in memory.	3*N/5+14 For N=100 74 cycles or 444 nsec

MATH

Benchmark Description Formula

Single Precision Floating Point Reciprocal The function performs the reciprocal using the RCPSP instruction and 2 iterations of the Newton-Rhapson algorithm to produce 23 bits of accuracy. 8 bits of accuracy can be achieved by simply using the RCPSP instruction by itself. 16 bits of accuracy is achieved with only one Newton-Rhapson iteration. 28 cycles

Double Precision Floating Point Reciprocal The function performs the reciprocal using the RCPDP instruction and 2 iterations of the Newton-Rhapson algorithm. 84 cycles

Single Precision Floating Point Reciprocal Square Root The function performs the reciprocal using the RCPSP instruction and 2 iterations of the Newton-Rhapson algorithm to produce 23 bits of accuracy. 8 bits of accuracy can be achieved by simply using the RCPSP instruction by itself. 16 bits of accuracy is achieved with only one Newton-Rhapson iteration. 34 cycles

Double Precision Floating Point Reciprocal Square Root This function performs the DP square root reciprocal using the RSQRDP instruction and 3 iterations of the Newton-Rhapson algorithm. 113 cycles

Return to top

Benchmark	Description	Formula
Single Precision Floating Point Reciprocal	The function performs the reciprocal using the RCPSP instruction and 2 iterations of the Newton-Rhapson algorithm to produce 23 bits of accuracy. 8 bits of accuracy can be achieved by simply using the RCPSP instruction by itself. 16 bits of accuracy is achieved with only one Newton-Rhapson iteration.	28 cycles
Double Precision Floating Point Reciprocal	The function performs the reciprocal using the RCPDP instruction and 2 iterations of the Newton-Rhapson algorithm.	84 cycles
Single Precision Floating Point Reciprocal Square Root	The function performs the reciprocal using the RCPSP instruction and 2 iterations of the Newton-Rhapson algorithm to produce 23 bits of accuracy. 8 bits of accuracy can be achieved by simply using the RCPSP instruction by itself. 16 bits of accuracy is achieved with only one Newton-Rhapson iteration.	34 cycles
Double Precision Floating Point Reciprocal Square Root	This function performs the DP square root reciprocal using the RSQRDP instruction and 3 iterations of the Newton-Rhapson algorithm.	113 cycles

3D GRAPHICS AND IMAGING

Benchmark Description Formula

3D Geometry Transformation This function performs the "front end" of a 3D graphics transformation pipeline. It performs geometry transformation, clipping preprocessing, perspective projection, and viewpoint mapping. Approx 10.4M vertices/second

Collision Detection This function takes a vector of 3D points and translates them in one dimension. The 1D distance from the translated point to the parameter "point" is calculated. If the distance is less than the parameter "distance", a collision is detected and the address of point is returned. There are no memory bank hits regardless of where the function parameters are placed in memory; but, the function performs extraneous loads. (N/2)*3+32 (worst case)
For N=10,000
15,032 cycles or 90.192 µsec

Return to top

		TMS320C6000™ Highest Performance DSP Platform
> Platform Summary > VelociTI™ Architecture > Applications > Development Tools > Technical Documentation Search > Platform Benchmarks > C62x DSPs > C64x DSPs > C67x DSPs > C6000 Compiler Benchmarks > C62x™ Fixed-Point DSPs > C67x™ Floating-Point DSPs > C6000 Compiler > MultiChannel Vocoder Technology Design Kit > Foundation Software > Training > DSP References Click here to view C6000 roadmap		C67x™ Floating-Point Benchmarks Filters Vector FFTs Search Math 3D-Graphics and Imaging

Benchmark	Description	Formula
3D Geometry Transformation	This function performs the "front end" of a 3D graphics transformation pipeline. It performs geometry transformation, clipping preprocessing, perspective projection, and viewpoint mapping.	Approx 10.4M vertices/second
Collision Detection	This function takes a vector of 3D points and translates them in one dimension. The 1D distance from the translated point to the parameter "point" is calculated. If the distance is less than the parameter "distance", a collision is detected and the address of point is returned. There are no memory bank hits regardless of where the function parameters are placed in memory; but, the function performs extraneous loads.	(N/2)*3+32 (worst case) For N=10,000 15,032 cycles or 90.192 µsec