Enhancing the Computational Performance of the C2000 Microcontroller Family

Trademarks

C2000 are trademarks of Texas Instruments.

All trademarks are the property of their respective owners.

1 Introduction

Real-time control systems require fast and efficient processing, with latency kept to a minimum in order to maintain stability and boost overall performance. In addition, the increasing sophistication of modern motor systems, power electronics, smart grid technology, robotics, and similar applications require the central processor to keep up with numerous tasks simultaneously.

The C2000 family of microcontrollers (MCUs) from Texas Instruments addresses these challenges with an array of integrated on-chip hardware math enhancements that dramatically increase the performance of the MCU in many real-time applications. The five key enhancements are:

Floating-Point Unit (FPU)
Control Law Accelerator (CLA)
Trigonometric Math Unit (TMU)
Fast Integer Division Unit (FINTDIV)
Viterbi, Complex Math, and CRC Unit (VCU)

Figure 1-1 System Block Diagram with Math Enhancements

At the center of each C2000 MCU lies a fast fixed-point central processing unit (CPU) that on its own provides excellent 32-bit processing capabilities. The FPU provides seamless integration of floating-point hardware into the CPU. To augment this further, the CLA provides an independent floating-point CPU operating at the full speed of the device and it is designed to perform control law computations with minimal latency. This effectively doubles the raw computing capabilities of the device. The TMU provides hardware support for common trigonometric math functions, while the FINTDIV enables fast integer division operations. The VCU adds hardware support for communications, complex math, and CRC calculations. This paper provides an overview of each of these math enhancements.

2 Floating-Point Unit (FPU)

Many control system designs typically start with simulation tools, where the algorithms are developed with floating-point math. These algorithms can then easily be ported to a microcontroller that has native floating-point math support. Floating-point math provides a large dynamic range, thereby making it easier to develop code compared to fixed-point math. The programmer no longer needs to worry about scaling and saturation. Additionally, robustness is improved since floating-point values do not wrap around the number line on an overflow or underflow, as they would in fixed-point math. These characteristics enable the high performance mathematical capabilities that are needed for advanced control systems. Also, the C2000 MCU architecture has been optimized to support high-level language programming, along with seamless support from a complete set of TI development tools.

The C2000 MCUs feature a C28x CPU that is designed around a 32-bit fixed-point accumulator-based architecture. It utilizes the best features of digital signal processors and microcontroller architectures. The addition of the FPU to the C28x fixed-point CPU enables the C2000 MCUs to support hardware IEEE-754 single-precision floating-point format operations. Devices with the C28x+FPU add an extended set of floating-point registers and instructions to the standard C28x architecture. These additional registers are: eight floating-point result registers, a floating-point status register, and a repeat block register. The repeat block adds zero overhead looping, which enables flexibility to the processor over the repeat single instruction. All of the registers are shadowed, except the repeat block register. Shadowing is useful with high priority interrupts for fast context save and restore of the floating-point registers.

Some C2000 MCUs are available with a FPU64 that provides hardware support for both IEEE-754 single-precision and double-precision floating-point operations. Devices with the C28x+FPU64 utilize the same registers as the FPU except for the addition of eight floating-point results extension registers for the double-precision floating-point operations. The FPU64 enhancements support all existing FPU single-precision floating-point instructions in addition to the 64-bit double-precision floating-point instructions.

The compiler tools provide C programming support for the CPU which makes it easy to write software, in addition to porting existing code. Since the FPU instructions are extensions of the standard C28x instruction set, most instructions operate in one or two pipeline cycles and some can be done in parallel. The FPU64 64-bit instructions operate in one to three pipeline cycles and some can be done in parallel, too. Floating-point performance dramatically enhances the mathematical computation horsepower used in signal processing and control algorithms.

Table 2-1 FPU Performance Improvements

Function	Type	FPU Cycles	FPU64 Cycles	Fixed Cycles	Improvements/Comments
Complex FFT	512 pt	24243	43935	63192	2.61x (FPU) / 1.44x (FPU64) vs Fixed Point
	1024 pt	53219	98683	141037	2.65x (FPU) / 1.43x (FPU64) vs Fixed-Point
Real FFT	512 pt	13670	20219	34513	2.52x (FPU) / 1.71x (FPU64) vs Fixed-Point
	1024 pt	30352	45476	76262	2.51x (FPU) / 1.68x (FPU64) vs Fixed-Point
Square Root	Compiler intrinsic	22	22	64	2.91x (FPU/FPU64) vs Fixed-Point – both modes use 32-bit float-point arguments
Finite impulse response (FIR)	64 pts	119	280	111	0.93x (FPU) / 0.40x (FPU64) vs Fixed-Point – FIR algorithms using circular addressing mode

3 Control Law Accelerator (CLA)

Enabling extremely high performance computation and efficient processing is critical for solving today’s complex real-time control applications. Real-time control systems require minimal latency where the time delay between sampling, processing, and outputting must fit within a tight time window in order to meet performance objectives. For example, a typical digital power controller consists of an ADC to read the input signals (voltage and current), a math engine to compute the control law algorithms (PID, 2-pole/2-zero, and 3-pole/3-zero compensators), and a PWM channel to output the calculated waveform. Many advanced control systems would greatly benefit from an architecture that integrates these functions in such a way as to minimize latency, yielding the absolute minimum sample to output delay. Ideally, this architecture would execute time-critical control loops concurrently with the main CPU and free it up to perform other required tasks. In addition, the architecture must have a built-in protection mechanism to guard against over-current and over-voltage conditions. To address these important requirements, TI developed the CLA.

The CLA is a fully-programmable independent 32-bit floating-point hardware accelerator that is designed for math intensive computations. This accelerator can offer a significant boost to the performance of typical math functions that are commonly found in control algorithms. The CLA is designed to execute real-time control algorithms in parallel with the C28x CPU, effectively doubling the computational performance. This makes the CLA perfect for managing low-level control loops with higher cycle performance improvements over the C28x CPU. Another advantage of the CLA is that since it directly accesses memory, the overhead penalty for managing a data page pointer is removed. Additionally, the multiplier on the CLA does not require any delay slots, thus providing true single-cycle performance. A device using the CLA can achieve about a 1.3 times performance improvement over the C28x CPU for applications like motor control and solar, as shown in the table below. Furthermore, by using the CLA to service time-critical functions, the C28x CPU is freed up for other tasks, such as communications and diagnostics.

Table 3-1 CLA Performance Improvements

Application	Number of Execution Cycles		Improvement
	CPU	CLA
	Min/Max	Min/Max
Motor AC Induction	888/952	639/694	1.39x (vs CPU)
Power CNTL 2p2z	48	39	1.23x (vs CPU)
Power CNTL 3p3z	68	52	1.31x (vs CPU)

Another key benefit of the CLA, over hardware-based control law implementations, is flexibility. The CLA is a fully software programmable solution where developers can freely modify their control system without the time and high cost required to redesign a hardware-based solution. CLA in addition to these benefits can also perform compute intensive functions such as FFT (both complex and real). Table 3-2 provides the details of the cycles

Table 3-2 CLA Performance for FFT

Function	Type	Cycles
FFT Complex	256 pt	27323
	512 pt	64538
	1024 pt	133881
Real FFT	512 pt	37537
Real FFT	1024 pt	85012

The CLA is able to minimize latency because it has direct access to the various control peripherals such as the ADC and PWM modules. Utilizing this low-latency architecture and capability to directly access the various control peripherals provides a fast trigger response. The CLA is able to read the ADC result register on the same cycle that the ADC sample conversion is completed. This “just-in-time” reading of the ADC reduces the sample to output delay and enables faster system response for higher frequency control loops.

Programming the CLA consists of initialization code and tasks. A task is similar to an interrupt service routine, and once started it runs to completion. Each task is capable of being triggered by a variety of peripherals without CPU intervention. This makes the CLA very efficient since it does not use interrupts for hardware synchronization, nor must the CLA do any context switching. Compared with the traditional interrupt-based scheme, the CLA approach eliminates jitter, and furthermore the execution time becomes deterministic. It supports eight independent tasks, each of which is mapped back to an event trigger, such as a timer or the availability of an ADC result. Separate tasks can be used to support multiple control loops or phases at the same time.

Some C2000 devices feature an enhanced version of the CLA with the option of running the lowest priority task as a background task. Once triggered, it runs continuously until it is terminated or reset by the CLA or MCU. The remaining tasks in priority order can interrupt the background task when they are triggered. If needed, portions of the background task can be made uninterruptible. Typical uses of the background task include running continuous functions, such as communications and clean-up routines.

4 Trigonometric Math Unit (TMU)

The TMU is an extension of the FPU and enhances the instruction set of the C28x+FPU by efficiently executing trigonometric and arithmetic operations that are commonly used in control system applications. Similar to the FPU, the TMU is an IEEE-754 floating-point math unit tightly coupled with the CPU. However, where the FPU provides general-purpose floating-point math support, the TMU focuses on accelerating several specific trigonometric math operations that would otherwise be quite cycle intensive. These operations include sine, cosine, arctangent, divide, and square root. Some C2000 devices include an enhanced version of the TMU for supporting nonlinear PID applications. Additional instructions have been added for efficient computation of logarithm and inverse exponent operations which are used in the nonlinear control law. The TMU instructions include:

Table 4-1 TMU Supported Instructions Summary

Operation	C Equivalent Operation
Multiply by 2*pi	a = b * 2pi
Divide by 2*pi	a = b / 2pi
Divide	a = b / c
Square Root	a = sqrt(b)
Sin Per Unit	a = sin(b*2pi)
Cos Per Unit	a = cos(b*2pi)
Arc Tangent Per Unit	a = atan(b)/2pi
Arc Tangent 2 and Quadrant Operation	Operation to assist in calculating ATANPU2
Logarithm	a = LOG₂(b)
Inverse Exponent	a = 2^-\|b\|

The TMU uses the same pipeline, memory bus architecture, and FPU registers as the C28x+FPU, thereby removing any special requirements for interrupt context save or restore.

The C2000 compiler has built-in support that allows automatic generation of the TMU instructions. The user writes code in C using math.h functions, and the compiler uses the TMU instructions, where applicable, instead of run-time support library calls. This results in significantly fewer cycles and dramatically increases the performance of trigonometric operations.

The TMU can have a significant impact on many commonly used real-time control algorithms such as:

Park and Inverse Park Transforms
Space Vector Generation
dq0 and Inverse dq0 Transforms
FFT Magnitude and Phase Calculations

For example, a Park Transform typically takes anywhere from 80 to more than 100 cycles to execute on the FPU. With the TMU a Park Transform takes only 13 cycles, yielding an 85 percent improvement as compared to without the TMU.

Figure 4-1 TMU Performance Improvement for Park Transform Example

In a typical system application, such as digital motor control (AC induction and permanent magnet) and 3-phase solar applications, about a 1.4 times performance improvement can be achieved using the TMU over just the FPU.

Table 4-2 TMU Performance Improvements

Application	Number of Execution Cycles		Improvement
	FPU	TMU
	Min/Max	Min/Max
Motor AC Induction	888/952	593/670	1.42x (vs FPU)
Motor Permanent Magnet	783/786	547/592	1.32x (vs FPU)
Solar 3-Phase	1351/1358	985/983	1.38x (vs FPU)

An existing C28x design can realize an immediate advantage using the TMU without the need to rewrite any code. Simulation-based generated code can realize the same benefits. Portability is maintained since the same code can be used on TI MCUs with and without the TMU support.