SPRUIV4D May   2020  – May 2024

 

  1.   1
  2.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  3. 2Introduction
    1. 2.1 C7000 Digital Signal Processor CPU Architecture Overview
    2. 2.2 C7000 Split Datapath and Functional Units
  4. 3C7000 C/C++ Compiler Options
    1. 3.1 Overview
    2. 3.2 Selecting Compiler Options for Performance
    3. 3.3 Understanding Compiler Optimization
      1. 3.3.1 Software Pipelining
      2. 3.3.2 Vectorization and Vector Predication
      3. 3.3.3 Automatic Use of Streaming Engine and Streaming Address Generator
      4. 3.3.4 Loop Collapsing and Loop Coalescing
      5. 3.3.5 Automatic Inlining
      6. 3.3.6 If Conversion
  5. 4Basic Code Optimization
    1. 4.1  Signed Types for Iteration Counters and Limits
    2. 4.2  Floating-Point Division
    3. 4.3  Loop-Carried Dependencies and the Restrict Keyword
      1. 4.3.1 Loop-Carried Dependencies
      2. 4.3.2 The Restrict Keyword
      3. 4.3.3 Run-Time Alias Disambiguation
    4. 4.4  Function Calls and Inlining
    5. 4.5  MUST_ITERATE and PROB_ITERATE Pragmas and Attributes
    6. 4.6  If Statements and Nested If Statements
    7. 4.7  Intrinsics
    8. 4.8  Vector Types
    9. 4.9  C++ Features to Use and Avoid
    10. 4.10 Streaming Engine
    11. 4.11 Streaming Address Generator
    12. 4.12 Optimized Libraries
    13. 4.13 Memory Optimizations
  6. 5Understanding the Assembly Comment Blocks
    1. 5.1 Software Pipelining Processing Stages
    2. 5.2 Software Pipeline Information Comment Block
      1. 5.2.1 Loop and Iteration Count Information
      2. 5.2.2 Dependency and Resource Bounds
      3. 5.2.3 Initiation Interval (ii) and Iterations
      4. 5.2.4 Constant Extensions
      5. 5.2.5 Resources Used and Register Tables
      6. 5.2.6 Stage Collapsing
      7. 5.2.7 Memory Bank Conflicts
      8. 5.2.8 Loop Duration Formula
    3. 5.3 Single Scheduled Iteration Comment Block
    4. 5.4 Identifying Pipeline Failures and Performance Issues
      1. 5.4.1 Issues that Prevent a Loop from Being Software Pipelined
      2. 5.4.2 Software Pipeline Failure Messages
      3. 5.4.3 Performance Issues
  7. 6Revision History

Memory Optimizations

Optimizations that improve the loading and storing of data are often crucial to the performance of an application. A detailed examination of useful memory optimizations on Keystone 3 devices is outside the scope of this document. However, the following are the most common optimizations used to aid memory system throughput and reduce memory hierarchy latency.

  • Blocking: Input, output, and temporary arrays/objects are often too large to fit into Multicore Shared Memory Controller (MSMC) or L2 memory. For example, when performing an algorithm over an entire 1000x1000 pixel image, the image is too large to fit into most or all configurations of L2 memory, and the algorithm may thrash the caches, leading to poor performance. Keeping the data as close to the CPU as possible improves memory system performance, but how do we do this when the image is too large to fit into the L2 cache? Depending on the algorithm, it may be useful to use a technique called "blocking," in which the algorithm is modified to operate on only a portion of the data at a given time. Once that "block" of data is processed, the algorithm moves to the next block. This technique is often paired with the other techniques in this list.
  • Direct Memory Access (DMA): Consider using the asynchronous DMA capabilities of the device to move new data into MSMC memory or L2 memory and DMA to move processed data out. This frees the C7000 CPU to perform computations while the DMA is readying data for the next frame, block, or layer.
  • Ping-Pong Buffers: Consider using ping-pong memory buffers so that the C7000 CPU is processing data in one buffer while a DMA transfer is occurring to/from another buffer. When the C7000 CPU is finished processing the first buffer, the algorithm switches to the second buffer, which now has new data as a result of a DMA transfer. Consider placing these buffers in MSMC or L2 memory, which is much faster than DDR memory.