SPRUJ28 User guide

SPRUJ28E November 2021 – September 2024 AM68 , AM68A , TDA4AL-Q1 , TDA4VE-Q1 , TDA4VL-Q1

6.4.2.1 C71x DSP CPU

The C71x CPU is a true 64-bit core which provides the following key features:

Instruction fetch unit
Instruction dispatch unit, advanced instruction packing
Instruction decode unit
CPU dual datapath
- One 64-bit scalar side (side A) and one 512-bit vector side (side B)
- Side A includes the following functional units
  - Four main scalar processing units (.L1, .S1, .M1, .N1) capable of operating on up to 64-bit wide data
  - Two units (.D1, .D2) for address calculations, enabling parallel load/store operations
  - The following operations can be executed at the same time in a single clock cycle
    - One non-aligned 64-bit load or store operation
    - Two 64-bit arithmetic/logical operations (non-multiply arithmetic instructions)
    - One 128-bit multiply operation
- Side B includes the following functional units
  - Five main vector processing units (.L2, .S2, .M2, .N2, .C) capable of operating on up to 512-bit wide vector data
  - A predication processing unit (.P) for vector predication
  - The following vector operations can be executed at the same time in a single clock cycle
    - One non-aligned 512-bit load or store operation
    - Two 512-bit arithmetic/logical operations (non-multiply arithmetic instructions)
    - One 1024-bit multiply operation
    - One 512-bit correlation operation or regular arithmetic operation
    - One vector predicate manipulation operation
- Vector side B can perform up to 128×16-bit fixed-point multiply-accumulate (MAC), 4.0 times the MAC capacity compared to C66x
- Vector side B can perform up to 80 single precision FLOPs/cycle or 32 double precision FLOPs/cycle, 5.0 times the floating point compared to C66x
- The C71x CPU can load or store in parallel 64 bits and 512 bits of data per clock cycle, more than four times the bandwidth compared to C66x. In addition, a novel streaming data interface can read an additional 1024 bits of data per clock cycle, providing a total of 12 times more bandwidth compared to the C66x
- Large set of machine registers:
  - Up to 16x512-bit global vector registers
  - Up to 16x64-bit global scalar registers
  - Local registers
- Register file cross paths
  - Two cross paths, 64-bit each
  - Allow functional units from one data path to access a 64-bit operand from the opposite side global register file
  - The cross paths can not access the local register files directly
CPU control logic
- Support for two security states
  - Secure state: CPU can access both secure and non-secure space
  - Non-secure state: CPU can only access non-secure space, and cannot access secure space
- Support for six privilege and execution levels for security and virtualization support with banked control registers for full isolation and protection
  - Secure root supervisor
  - Secure root user
  - Non-secure root supervisor
  - Non-secure root user
  - Non-secure guest supervisor
  - Non-secure guest user
- Pipeline can operate in both unprotected and protected modes
  - Unprotected mode: Traditional C6x VLIW DSP operating mode with exposed instruction delay slots
  - Protected mode: Control code execution mode where instructions delay slots are not exposed
- Scoreboarding mechanism to help hide memory system stalls and to allow the compiler to generate more efficient control codes
- Supports recoverable interrupts and internal precise exception
  - No need to disable interrupts during multiple assignment code. In-flight results in multiple-assignment code are recorded by pipeline capture queues. These queues allow all programs to be interrupted at any arbitrary cycle, even during software pipelined loops, and then to be restarted correctly upon return from the interrupt or exception handler
  - Pipeline capture queues may be unloaded and reloaded, allowing the OS to swap out tasks regardless of whether the task was operating in protected or unprotected pipeline mode
  - Programmable event priority, hardware automatic event mask based on priority levels
  - Hardware performs automatic stacking at even handler entry and automatic unstacking at event exiting
- Nested loop counters (NLC): Hardware mechanism to facilitate low-overhead loop collapsing by providing the computations associated with nested loop counters and predicates, avoid the needs of using explicit instructions
Test, debug, and interrupt logic
Enhanced instruction set architecture (ISA)
- Single-cycle 64-bit arithmetic, logical, and shift instructions
- Improvements to load/store instructions
  - Endian aware vector load instructions
  - Unpacking load / packing store instructions
  - Vector load/store with element reversal
  - Vector load with data duplications
- Specialized instructions to speed up key benchmarks
  - Horizontal MAX/MIN search acceleration instructions
  - Horizontal ADD, SUB instructions
  - Dedicated FIR instructions
  - Sliding window sum of absolute differences (SAD)
  - Maskable complex dot products
- Circular comparison instructions to accelerate Viterbi
- New and improved atomic instructions to replace LL/SC/CL combination
- Increases orthogonality between 32-bit, 64-bit, and vector operations compared to C66x
- Supports binlog instruction for vision algorithms
- Specialized instructions only allowed to execute at required privilege level and secure state
C71x OpenCL features
- Full compliance with IEEE 754 floating point standard
  - Support subnormal numbers
- DVM support
  - Support fencing for uTLB transactions

The matrix multiply accelerator (MMA) is included as a special functional unit in the C71x CorePac and is tightly coupled to CPU operation.