TMS320C6000 Platform Overview | ||
> 'C6000 Platform Summary > 'C6000 VelociTI Architecture > 'C6000 Applications more... > 'C6000 Development Tools > Technical Documentation Search > 'C6000 Application Notes > 'C6000 Training > 'C6000 Benchmarks > 'C67x Floating-Point Benchmarks > DSP References > 'C62x Fixed-Point DSPs > 'C67x Floating-Point DSPs | 'C6000 Benchmarks Filters Vector FFTs Search Math Graphics Telecom |
Benchmark | Description | Formula |
---|---|---|
FIR-coefficients a multiple of 4 | This FIR assumes the number of filter coefficients is a multiple of 4 and the number of output samples is a multiple of 2. It operates on 16-bit data with a 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N coefficients. | M*(N+8)/2 + 6 For N=32 and M=100 2006 cycles or 10.03 µsec |
FIR-coefficients a multiple of 8 | This FIR assumes the number of filter coeficients is a multiple of 8 and the number of output samples is a multiple of 2. It operates on 16-bit data with a 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N h coefficients. | M*N/2 + 13 For N=32 and M=100 1613 cycles or 8.06 µsec |
Complex FIR | FIR operates on complex 16-bit data with a complex 32-bit accumulate. This routine has no memory hits regardless of where x, h, and y arrays are located in memory. The filter is M output samples and N coefficients. | 2*M*N + 10 For M = 100 and N = 32: 6410 cycles or 32 µsec |
LMS FIR - coefficients a multiple of 2 | Least Mean Square Adaptive Filter. Computes an update of all N coeficients by adding the weighted error times the inputs to the original coefficients. This assumes single sample input followed by the last N-1 inputs and N coefficients. | 1.5*N+16 For N=30 61 cycles or 305 nsec |
LMS FIR - coefficients a multiple of 8 | Least Mean Square Adaptive Filter. Computes an update of all N coeficients by adding the weighted error times the inputs to the original coefficients followed by an FIR with N coefficients and M output samples and an error calculation. This assumes that N is a multiple of 8. (N=number of data samples, multiple of 8 >=8) | M*(9/8*N+15)+5 |
IIR filter | Performs an Auto-regressive moving-average (ARMA) filter with 4 auto-regressive filter coefficients and 5 moving-average filter coefficients for M output samples. Output vector is stored to two locations. This routined is used as a high pass filter in the VSELP vocoder. | (M*5 + 16) For M = 160: 816 cycles or 4.08 µsec |
FIR Circular | Finite
Impulse Response Filter. Uses circular addressing with
initial index. Performs filtering 2 samples at a time. (N=number of data samples, even >=2) (M=number of filter coefficients, multiple of 4 >=4) | M*(N+11)/2+13 For N=32 and M=32 701 cycles or 3.505 µsec |
Lattice Analysis | Lattice
Filter - Inverse - Analysis. (N=number of coefficients) | 1.5*N+10 For N=10 25 cycles or 125 nsec |
Lattice Synthesis | Lattice
Filter - Forward - Synthesis. (N=number of data samples, even >= 6) | 2N+18 For N=10 38 cycles or 190 nsec |
IIR with 4 biquads cascaded | Infinite Impulse Response Filter. Direct Form II - 4 Multiplies. Processes 2 samples at a time. (N=number of cascaded biquads) | 4N+16 For N=10 56 cycles or 280 nsec |
Autocorrelation | Performs autocorrelation of a 16-bit vector. Nested loop with M inner loop multiply accumulates and outer loops. | (N/2) *M + 16 + M/4 For N=160 and M=10; 816 cycles or 4.08 µsec |
Benchmark | Description | Formula |
---|---|---|
dot product | Dot product of two vectors of length N | N/2 + 8 For N = 100 58 cycles or 290 nsec |
Weighted vector sum | Performs an N element vector sum of two vectors with one vector weighted by constant. The result is stored in a third vector. | N+10 For N = 40: 49 cycles or 245 nsec |
Vector dot product and square | Performs an N element dot product and each of the N elements of one of the vectors is squared and accumulated. This is used to compute G in the VSELP coder. | N + 8 For N = 40: 48 cycles or 240 nsec |
Block move | Move N 16-bit elements from one memory location to another. | N/2 + 5 For N = 40: 25 cycles or 125 nsec |
Sum of squares | Each of N elements in a vector is squared and accumulated. This particular loop is used to compute Gl in the VSELP vocoder codebook search. | (N-1)/2 + 9 For N = 21: 19 cycles |
Benchmark | Description | Formula |
---|---|---|
Two-level-cache efficient Complex Radix 4 FFT | Complex Radix 4 FFT of size N. This FFT uses a redundant sequence (N twiddles for N-point FFT) of twiddle factors to allow a linear access through the data. This linear access consumes the entire contents of a cache line before it uses another one resulting in efficient cache usage. | 10 * log4(N) * (0.25 * N + 3) + 22 for N = 1024, cycles = 12972 |
Complex Radix 4 FFT | Complex Radix 4 FFT of size N | Log(base4)N *
(10 * N/4 + 33) + 7 + N/4 For N = 1024: 13228 cycles or 66 µsec |
Complex Radix 2 FFT | Complex Radix 2 FFT of size N | Log(base2)N *
(4 * N/2 + 7) + 9 + N/4 For N = 1024: 20815 cycles or 104 µsec |
Bit Reverse | The Bit-Reverse routine performs the bit-reversal of length N on an array of 16-bit complex data length N. | Cycle Count: 7*(N/4 + 2) + 14 For N = 1024 Cycle Count = 1820 or 9.1µs Lookup Table Size: 32 Halfwords (64 Bytes) |
Benchmark | Description | Formula |
---|---|---|
Minimum energy error search | Performs a dot product on 256 pairs of 9 element vectors and searches for the pair of vectors which produces the maximum dot product result. This is a large part of the VSELP vocoder codebook search. | (256/2)*9 +
14 1166 cycles or 5.83 µsec |
Vector Max | Finds the maximum value in a vector of length N. | N/2 + 13 For N = 100: 64 cycles or 320 nsec |
Vector Max Index | Finds the maximum value in a vector of length N and stores the index of that location. | 2N/3 + 12 For N = 100: 79 cycles or 395 nsec |
codebook search for VSELP | Performs VSELP vocoder codebook search. The C source code for this was written by Motorola Systems Research Laboratories and is authorized by Motorola for the use of development of North American digital cellular standards. As such, the C code cannot be shown here. This routine performs the entire v_srch.c function as written by Motorola. It involves calculating correlations between weighted basis vectors and weighted speech vector (Rm's), C0, and 0.25 * sum of Djj for G0. It then calculates all Dmj and finishes calculating G0. It then initializes the best vector to be code vector zero and performs search by finding the vector that produces the highest C^2/G value. | Loop1 Loop2 Loop3 342 + 639 + 2087 = 3068cycles |
Benchmark | Description | Formula |
---|---|---|
ADD40 | Adds two 40-bit values to produce a 40-bit result. This code sample is not a complete | N/A |
ADD64 | Adds two 64-bit values to produce a 64-bit result. This code sample is not a complete function! | N/A |
SUB40 | Subtracts one 40-bit value from another 40-bit value to produce a 40-bit result. This code sample is NOT a complete function! | N/A |
SUB64 | Subtracts one 64-bit value from another 64-bit value to produce a 64-bit result. This code sample is NOT a complete function! | N/A |
DIVMOD32 | This routine divides two 32 bit values and returns their quotient and remainder. The inputs are 32-bit numbers, and the result is a 32-bit number. Cycles (Min execution 16 cycles, Max execution 41 cycles). This code sample is NOT a complete function! | N/A |
DIVMODU32 | This routine divides two unsigned 32 bit values and returns their quotient and remainder. The inputs are unsigned 32-bit numbers, and the result is a unsigned 32-bit number. Cycles (Min execution 18 cycles, Max execution 42 cycles) This code sample is NOT a complete function! | N/A |
MPY32 | This routine takes two 32 bit integer values and calculates their product. The inputs are 32-bit integer, and the result is a 32-bit integer. Cycles (See routine) put the note. This code sample is NOT a complete function! | N/A |
MPY3240 | This routine takes two 32 bit integer values and calculates their product. The inputs are 32-bit integer, and the result is a 40-bit integer. Cycles (See routine)This code sample is NOT a complete function! | N/A |
MPYU3240 | This routine takes two 32 bit unsigned integer values and calculates their product. The inputs are 32-bit unsigned integer, and the result is a 40-bit unsigned integer.Cycles (See routine)This code sample is NOT a complete function! | N/A |
MPY40 | This routine takes two 40 bit integer values and calculates their product. The inputs are 40-bit integer, and the result is a 40-bit integer. Cycles (See routine)This code sample is NOT a complete function! | N/A |
MPY3264 | This routine takes two 32 bit integer
values and calculates their product. The inputs are 32-bit integer, and the result is a 64-bit integer. Cycles (See routine) | N/A |
MPYU3264 | This routine takes two 32 bit unsigned
integer values and calculates their product. The inputs are 32-bit unsigned integers, and the result is a 64-bit unsigned integer. Cycles (See routine) | N/A |
Benchmark | Description | Formula |
---|---|---|
8x8 Block IDCT - IEEE-1180 Compliant | The idct_8x8 algorithm performs an IEEE-1180 compliant IDCT, complete with rounding and saturation to signed 9-bit quantities. The array should be aligned to a 32-bit boundary, and be laid out equivalently to the C array idct_data[num_idcts+1][8][8]. The input coefficients are assumed to be signed 12-bit cosine terms. | Cycles = 62 + 168 * num_idcts for num_idcts >= 1 230 cycles or 1.15 µs for one 8x8 Block of Data |
8x8 Block FDCT With Rounding | The fdct routine accepts a list of 8x8 pixel blocks and performs FDCTs on each. The array should be laid out equivalently to the C array dct_data[num_fdcts+1][8][8]. All operations in this array are performed entirely in-place. Input values are stored in shorts, and may be in the range [-512,511]. Input terms are expected to be signed 11Q0 values, producing signed 15Q0 results. | Cycles = 48 + 160 * num_fdcts for num_fdcts >= 1 208 cycles or 1.04 µs for one 8x8 Block of Data |
Gouraud | Gouraud Shading of a scanline of pixels. Four pixels of a line at a time are processed. (N=pixels >=4, multiple of 4 pixels) | 2N+7 For 1024 pixels taken 4 pixels at a time 2055 cycles or 10.275 µsec |
Benchmark | Description | Formula |
---|---|---|
Viterbi Equalization | Viterbi Equalizer - GSM (N=number of data points) | 43N + 2 For N=120 5162 cycles or 25.810 µsec |
Viterbi GSM | Viterbi Channel Decoder (GSM) (N=number of data points) | 38N + 12 +
N/4 For N=189 7242 cycles or 36.21 µsec |
Viterbi IS54 | Viterbi
Channel Decoder (IS54) (N=number of data points) | 66.5*N+16 For N=189 5934 cycles or 29.67µsec |
Viterbi V.32 | Viterbi V.32 PSTN Trellis Decoder. (N=number of data points) | 64 cycles or 320nsec |