SPRAD74 White paper

SPRAD74 March 2023 AM62A3 , AM62A3-Q1 , AM62A7 , AM62A7-Q1

4.1 Deep Learning Accelerator

TI's processors use a state-of-the-art deep learning accelerator design. TI has a long history of digital signal processors (DSPs) that have become increasingly integrated other SoCs at TI; however, a DSP alone is insufficient for most vision deep learning models. Our deep learning accelerator is a tight-coupling of a C7x DSP and a custom matrix-multiply accelerator (MMA), which massively increases performance on neural networks (NN), especially convolutional NNs (CNN) that are common in vision AI.

The AM62A's deep learning accelerator uses a 256-bit C7x DSP and an MMA capable of performing 32x32 matrix multiplies on 8-bit integer values in a single clock cycle. When run at the maximum of 1 GHz, this provides a max compute capacity of 2 TOPS as the 32x32 matrix operation is 1024 multiply-accumulates (MACs, where each MAC is considered two operations). To ensure the MMA always has values to compute, the architecture includes multiple streaming engines that move 256-bits of data each clock cycle to the two input matrices from the single output matrix. Depending on the layers composing the neural network architecture, outputs from the MMA may be sent through the C7x for computing any non-linear functions within the layer. Developers need not program this themselves; API calls from the Arm cores reduce the complexity of programming the accelerator, as described in the Edge AI software section.

While TOPS is a common metric for quantifying machine learning performance on accelerators like TPUs, VPUs, NPUs, and GPUs, etc., one accelerator architecture may outperform another despite having a lower theoretical compute capacity. TI's architecture was designed to optimize power and performance by using a single large compute unit, the MMA, rather than many smaller ones in parallel. With many small units, more transfers to memory are required as there is less data reuse of the same data for subsequent execution cycles. More transfers equate to higher power expenditure. Specially designed data-streaming engines maintain the 256-bit buffers within the accelerator hold the necessary data. A well-optimized application will use a model whose dimensions at each layer completely fill the MMA.

Figure 4-1 AI Accelerator Architecture