SPRY288C April 2020 – December 2021 TMS320C28341 , TMS320C28342 , TMS320C28343 , TMS320C28343-Q1 , TMS320C28344 , TMS320C28345 , TMS320C28346 , TMS320C28346-Q1 , TMS320F280021 , TMS320F280021-Q1 , TMS320F280023 , TMS320F280023-Q1 , TMS320F280023C , TMS320F280025 , TMS320F280025-Q1 , TMS320F280025C , TMS320F280025C-Q1 , TMS320F280040-Q1 , TMS320F280040C-Q1 , TMS320F280041 , TMS320F280041-Q1 , TMS320F280041C , TMS320F280041C-Q1 , TMS320F280045 , TMS320F280048-Q1 , TMS320F280048C-Q1 , TMS320F280049 , TMS320F280049-Q1 , TMS320F280049C , TMS320F280049C-Q1 , TMS320F2802 , TMS320F2802-Q1 , TMS320F28020 , TMS320F280200 , TMS320F28021 , TMS320F28022 , TMS320F28022-Q1 , TMS320F280220 , TMS320F28023 , TMS320F28023-Q1 , TMS320F280230 , TMS320F28026 , TMS320F28026-Q1 , TMS320F28026F , TMS320F28027 , TMS320F28027-Q1 , TMS320F280270 , TMS320F28027F , TMS320F28027F-Q1 , TMS320F28030 , TMS320F28030-Q1 , TMS320F28031 , TMS320F28031-Q1 , TMS320F28032 , TMS320F28032-Q1 , TMS320F28033 , TMS320F28033-Q1 , TMS320F28034 , TMS320F28034-Q1 , TMS320F28035 , TMS320F28035-EP , TMS320F28035-Q1 , TMS320F28050 , TMS320F28051 , TMS320F28052 , TMS320F28052-Q1 , TMS320F28052F , TMS320F28052F-Q1 , TMS320F28052M , TMS320F28052M-Q1 , TMS320F28053 , TMS320F28054 , TMS320F28054-Q1 , TMS320F28054F , TMS320F28054F-Q1 , TMS320F28054M , TMS320F28054M-Q1 , TMS320F28055 , TMS320F2806 , TMS320F2806-Q1 , TMS320F28062 , TMS320F28062-Q1 , TMS320F28062F , TMS320F28062F-Q1 , TMS320F28063 , TMS320F28064 , TMS320F28065 , TMS320F28066 , TMS320F28066-Q1 , TMS320F28067 , TMS320F28067-Q1 , TMS320F28069 , TMS320F28069-Q1 , TMS320F28075 , TMS320F28075-Q1 , TMS320F28332 , TMS320F28333 , TMS320F28334 , TMS320F28335 , TMS320F28335-Q1 , TMS320F28374D , TMS320F28374S , TMS320F28375D , TMS320F28375S , TMS320F28375S-Q1 , TMS320F28376D , TMS320F28376S , TMS320F28377D , TMS320F28377D-EP , TMS320F28377D-Q1 , TMS320F28377S , TMS320F28377S-Q1 , TMS320F28379D , TMS320F28379D-Q1 , TMS320F28379S
C2000 are trademarks of Texas Instruments.
All trademarks are the property of their respective owners.
Real-time control systems require fast and efficient processing, with latency kept to a minimum in order to maintain stability and boost overall performance. In addition, the increasing sophistication of modern motor systems, power electronics, smart grid technology, robotics, and similar applications require the central processor to keep up with numerous tasks simultaneously.
The C2000 family of microcontrollers (MCUs) from Texas Instruments addresses these challenges with an array of integrated on-chip hardware math enhancements that dramatically increase the performance of the MCU in many real-time applications. The five key enhancements are:
At the center of each C2000 MCU lies a fast fixed-point central processing unit (CPU) that on its own provides excellent 32-bit processing capabilities. The FPU provides seamless integration of floating-point hardware into the CPU. To augment this further, the CLA provides an independent floating-point CPU operating at the full speed of the device and it is designed to perform control law computations with minimal latency. This effectively doubles the raw computing capabilities of the device. The TMU provides hardware support for common trigonometric math functions, while the FINTDIV enables fast integer division operations. The VCU adds hardware support for communications, complex math, and CRC calculations. This paper provides an overview of each of these math enhancements.
Many control system designs typically start with simulation tools, where the algorithms are developed with floating-point math. These algorithms can then easily be ported to a microcontroller that has native floating-point math support. Floating-point math provides a large dynamic range, thereby making it easier to develop code compared to fixed-point math. The programmer no longer needs to worry about scaling and saturation. Additionally, robustness is improved since floating-point values do not wrap around the number line on an overflow or underflow, as they would in fixed-point math. These characteristics enable the high performance mathematical capabilities that are needed for advanced control systems. Also, the C2000 MCU architecture has been optimized to support high-level language programming, along with seamless support from a complete set of TI development tools.
The C2000 MCUs feature a C28x CPU that is designed around a 32-bit fixed-point accumulator-based architecture. It utilizes the best features of digital signal processors and microcontroller architectures. The addition of the FPU to the C28x fixed-point CPU enables the C2000 MCUs to support hardware IEEE-754 single-precision floating-point format operations. Devices with the C28x+FPU add an extended set of floating-point registers and instructions to the standard C28x architecture. These additional registers are: eight floating-point result registers, a floating-point status register, and a repeat block register. The repeat block adds zero overhead looping, which enables flexibility to the processor over the repeat single instruction. All of the registers are shadowed, except the repeat block register. Shadowing is useful with high priority interrupts for fast context save and restore of the floating-point registers.
Some C2000 MCUs are available with a FPU64 that provides hardware support for both IEEE-754 single-precision and double-precision floating-point operations. Devices with the C28x+FPU64 utilize the same registers as the FPU except for the addition of eight floating-point results extension registers for the double-precision floating-point operations. The FPU64 enhancements support all existing FPU single-precision floating-point instructions in addition to the 64-bit double-precision floating-point instructions.
The compiler tools provide C programming support for the CPU which makes it easy to write software, in addition to porting existing code. Since the FPU instructions are extensions of the standard C28x instruction set, most instructions operate in one or two pipeline cycles and some can be done in parallel. The FPU64 64-bit instructions operate in one to three pipeline cycles and some can be done in parallel, too. Floating-point performance dramatically enhances the mathematical computation horsepower used in signal processing and control algorithms.
Function | Type | FPU Cycles | FPU64 Cycles | Fixed Cycles | Improvements/Comments |
---|---|---|---|---|---|
Complex FFT | 512 pt | 24243 | 43935 | 63192 | 2.61x (FPU) / 1.44x (FPU64) vs Fixed Point |
1024 pt | 53219 | 98683 | 141037 | 2.65x (FPU) / 1.43x (FPU64) vs Fixed-Point | |
Real FFT | 512 pt | 13670 | 20219 | 34513 | 2.52x (FPU) / 1.71x (FPU64) vs Fixed-Point |
1024 pt | 30352 | 45476 | 76262 | 2.51x (FPU) / 1.68x (FPU64) vs Fixed-Point | |
Square Root | Compiler intrinsic | 22 | 22 | 64 | 2.91x (FPU/FPU64) vs Fixed-Point – both modes use 32-bit float-point arguments |
Finite impulse response (FIR) | 64 pts | 119 | 280 | 111 | 0.93x (FPU) / 0.40x (FPU64) vs Fixed-Point – FIR algorithms using circular addressing mode |
Enabling extremely high performance computation and efficient processing is critical for solving today’s complex real-time control applications. Real-time control systems require minimal latency where the time delay between sampling, processing, and outputting must fit within a tight time window in order to meet performance objectives. For example, a typical digital power controller consists of an ADC to read the input signals (voltage and current), a math engine to compute the control law algorithms (PID, 2-pole/2-zero, and 3-pole/3-zero compensators), and a PWM channel to output the calculated waveform. Many advanced control systems would greatly benefit from an architecture that integrates these functions in such a way as to minimize latency, yielding the absolute minimum sample to output delay. Ideally, this architecture would execute time-critical control loops concurrently with the main CPU and free it up to perform other required tasks. In addition, the architecture must have a built-in protection mechanism to guard against over-current and over-voltage conditions. To address these important requirements, TI developed the CLA.
The CLA is a fully-programmable independent 32-bit floating-point hardware accelerator that is designed for math intensive computations. This accelerator can offer a significant boost to the performance of typical math functions that are commonly found in control algorithms. The CLA is designed to execute real-time control algorithms in parallel with the C28x CPU, effectively doubling the computational performance. This makes the CLA perfect for managing low-level control loops with higher cycle performance improvements over the C28x CPU. Another advantage of the CLA is that since it directly accesses memory, the overhead penalty for managing a data page pointer is removed. Additionally, the multiplier on the CLA does not require any delay slots, thus providing true single-cycle performance. A device using the CLA can achieve about a 1.3 times performance improvement over the C28x CPU for applications like motor control and solar, as shown in the table below. Furthermore, by using the CLA to service time-critical functions, the C28x CPU is freed up for other tasks, such as communications and diagnostics.
Application | Number of Execution Cycles | Improvement | |
---|---|---|---|
CPU | CLA | ||
Min/Max | Min/Max | ||
Motor AC Induction | 888/952 | 639/694 | 1.39x (vs CPU) |
Power CNTL 2p2z | 48 | 39 | 1.23x (vs CPU) |
Power CNTL 3p3z | 68 | 52 | 1.31x (vs CPU) |
Another key benefit of the CLA, over hardware-based control law implementations, is flexibility. The CLA is a fully software programmable solution where developers can freely modify their control system without the time and high cost required to redesign a hardware-based solution. CLA in addition to these benefits can also perform compute intensive functions such as FFT (both complex and real). Table 3-2 provides the details of the cycles
Function | Type | Cycles |
---|---|---|
FFT Complex | 256 pt | 27323 |
512 pt | 64538 | |
1024 pt | 133881 | |
Real FFT | 512 pt | 37537 |
1024 pt | 85012 |
The CLA is able to minimize latency because it has direct access to the various control peripherals such as the ADC and PWM modules. Utilizing this low-latency architecture and capability to directly access the various control peripherals provides a fast trigger response. The CLA is able to read the ADC result register on the same cycle that the ADC sample conversion is completed. This “just-in-time” reading of the ADC reduces the sample to output delay and enables faster system response for higher frequency control loops.
Programming the CLA consists of initialization code and tasks. A task is similar to an interrupt service routine, and once started it runs to completion. Each task is capable of being triggered by a variety of peripherals without CPU intervention. This makes the CLA very efficient since it does not use interrupts for hardware synchronization, nor must the CLA do any context switching. Compared with the traditional interrupt-based scheme, the CLA approach eliminates jitter, and furthermore the execution time becomes deterministic. It supports eight independent tasks, each of which is mapped back to an event trigger, such as a timer or the availability of an ADC result. Separate tasks can be used to support multiple control loops or phases at the same time.
Some C2000 devices feature an enhanced version of the CLA with the option of running the lowest priority task as a background task. Once triggered, it runs continuously until it is terminated or reset by the CLA or MCU. The remaining tasks in priority order can interrupt the background task when they are triggered. If needed, portions of the background task can be made uninterruptible. Typical uses of the background task include running continuous functions, such as communications and clean-up routines.
Another key benefit of the CLA, over hardware-based control law implementations, is flexibility. The CLA is a fully software programmable solution where developers can freely modify their control system without the time and high cost required to redesign a hardware-based solution.
The TMU is an extension of the FPU and enhances the instruction set of the C28x+FPU by efficiently executing trigonometric and arithmetic operations that are commonly used in control system applications. Similar to the FPU, the TMU is an IEEE-754 floating-point math unit tightly coupled with the CPU. However, where the FPU provides general-purpose floating-point math support, the TMU focuses on accelerating several specific trigonometric math operations that would otherwise be quite cycle intensive. These operations include sine, cosine, arctangent, divide, and square root. Some C2000 devices include an enhanced version of the TMU for supporting nonlinear PID applications. Additional instructions have been added for efficient computation of logarithm and inverse exponent operations which are used in the nonlinear control law. The TMU instructions include:
Operation | C Equivalent Operation |
---|---|
Multiply by 2*pi | a = b * 2pi |
Divide by 2*pi | a = b / 2pi |
Divide | a = b / c |
Square Root | a = sqrt(b) |
Sin Per Unit | a = sin(b*2pi) |
Cos Per Unit | a = cos(b*2pi) |
Arc Tangent Per Unit | a = atan(b)/2pi |
Arc Tangent 2 and Quadrant Operation | Operation to assist in calculating ATANPU2 |
Logarithm | a = LOG2(b) |
Inverse Exponent | a = 2-|b| |
The TMU uses the same pipeline, memory bus architecture, and FPU registers as the C28x+FPU, thereby removing any special requirements for interrupt context save or restore.
The C2000 compiler has built-in support that allows automatic generation of the TMU instructions. The user writes code in C using math.h functions, and the compiler uses the TMU instructions, where applicable, instead of run-time support library calls. This results in significantly fewer cycles and dramatically increases the performance of trigonometric operations.
The TMU can have a significant impact on many commonly used real-time control algorithms such as:
For example, a Park Transform typically takes anywhere from 80 to more than 100 cycles to execute on the FPU. With the TMU a Park Transform takes only 13 cycles, yielding an 85 percent improvement as compared to without the TMU.
In a typical system application, such as digital motor control (AC induction and permanent magnet) and 3-phase solar applications, about a 1.4 times performance improvement can be achieved using the TMU over just the FPU.
Application | Number of Execution Cycles | Improvement | |
---|---|---|---|
FPU | TMU | ||
Min/Max | Min/Max | ||
Motor AC Induction | 888/952 | 593/670 | 1.42x (vs FPU) |
Motor Permanent Magnet | 783/786 | 547/592 | 1.32x (vs FPU) |
Solar 3-Phase | 1351/1358 | 985/983 | 1.38x (vs FPU) |
An existing C28x design can realize an immediate advantage using the TMU without the need to rewrite any code. Simulation-based generated code can realize the same benefits. Portability is maintained since the same code can be used on TI MCUs with and without the TMU support.