SPRUIV4D May   2020  – May 2024

 

  1.   1
  2.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  3. 2Introduction
    1. 2.1 C7000 Digital Signal Processor CPU Architecture Overview
    2. 2.2 C7000 Split Datapath and Functional Units
  4. 3C7000 C/C++ Compiler Options
    1. 3.1 Overview
    2. 3.2 Selecting Compiler Options for Performance
    3. 3.3 Understanding Compiler Optimization
      1. 3.3.1 Software Pipelining
      2. 3.3.2 Vectorization and Vector Predication
      3. 3.3.3 Automatic Use of Streaming Engine and Streaming Address Generator
      4. 3.3.4 Loop Collapsing and Loop Coalescing
      5. 3.3.5 Automatic Inlining
      6. 3.3.6 If Conversion
  5. 4Basic Code Optimization
    1. 4.1  Signed Types for Iteration Counters and Limits
    2. 4.2  Floating-Point Division
    3. 4.3  Loop-Carried Dependencies and the Restrict Keyword
      1. 4.3.1 Loop-Carried Dependencies
      2. 4.3.2 The Restrict Keyword
      3. 4.3.3 Run-Time Alias Disambiguation
    4. 4.4  Function Calls and Inlining
    5. 4.5  MUST_ITERATE and PROB_ITERATE Pragmas and Attributes
    6. 4.6  If Statements and Nested If Statements
    7. 4.7  Intrinsics
    8. 4.8  Vector Types
    9. 4.9  C++ Features to Use and Avoid
    10. 4.10 Streaming Engine
    11. 4.11 Streaming Address Generator
    12. 4.12 Optimized Libraries
    13. 4.13 Memory Optimizations
  6. 5Understanding the Assembly Comment Blocks
    1. 5.1 Software Pipelining Processing Stages
    2. 5.2 Software Pipeline Information Comment Block
      1. 5.2.1 Loop and Iteration Count Information
      2. 5.2.2 Dependency and Resource Bounds
      3. 5.2.3 Initiation Interval (ii) and Iterations
      4. 5.2.4 Constant Extensions
      5. 5.2.5 Resources Used and Register Tables
      6. 5.2.6 Stage Collapsing
      7. 5.2.7 Memory Bank Conflicts
      8. 5.2.8 Loop Duration Formula
    3. 5.3 Single Scheduled Iteration Comment Block
    4. 5.4 Identifying Pipeline Failures and Performance Issues
      1. 5.4.1 Issues that Prevent a Loop from Being Software Pipelined
      2. 5.4.2 Software Pipeline Failure Messages
      3. 5.4.3 Performance Issues
  7. 6Revision History

Loop-Carried Dependencies

In certain cases when software pipelining, the compiler will not be able to overlap successive iterations of the loop in order to get the best performance. When the compiler is not able to overlap successive iterations of the loop, performance suffers: the initiation interval (ii, described earlier) will be larger than desired and few functional units will be simultaneously utilized.

In almost all cases, this is due to a loop-carried dependency. A loop-carried dependency prevents to some degree the execution of iteration i+1 from overlapping with iteration i. A loop-carried dependency bound is a lower limit on the initiation interval of the software pipelined loop (and thus a limit on the speed of the software pipelined loop). A loop-carried dependency bound arises because there is a cycle in the ordering constraints (dependencies) for a set of the instructions in a loop. Out of all these cycle lengths in the loop, the maximum loop-carried dependency cycle is the loop-carried dependency bound. This can occur even if there are plenty of functional units available to perform several iterations in parallel.

If the loop-carried dependency bound is greater than the partitioned resource bound, then one of the loop-carried dependencies is slowing the loop, as the initiation interval is always at least the maximum of the partitioned resource bound and the loop-carried dependency bound.

To reduce or eliminate a problematic loop-carried dependency, one must identify the cycle and then find a way to shorten or break it.

The following example shows a loop with a problematic loop-carried dependency.

void weighted_sum(int * a, int * b, int * out,                                 
                  int weight_a, int weight_b, int n)                           
{                                                                              
    #pragma UNROLL(1)                                                          
    #pragma MUST_ITERATE(1024, ,32)                                            
    for (int i = 0; i < n; i++)                                                
    {                                                                          
        out[i] = a[i] * weight_a + b[i] * weight_b;                            
    }                                                                          
}

The compiler-generated assembly code for this example (shown below) shows that the Loop Carried Dependency Bound in the Software Pipeline Information section of the assembly code is 7 cycles.

;*----------------------------------------------------------------------------* 
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : weighted_vector_sum_v3.cpp
;*      Loop source line                 : 10
;*      Loop opening brace source line   : 11
;*      Loop closing brace source line   : 13
;*      Known Minimum Iteration Count    : 1024                    
;*      Known Max Iteration Count Factor : 32
;*      Loop Carried Dependency Bound(^) : 12
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound       : 2 (pre-sched)
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 12 Schedule found with 2 iterations in parallel
. . .
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        ||$C$C51||:
;*   0              TICK                               ; [A_U] 
;*   1              LDW     .D2     *D1++(4),BM0      ; [A_D2] |12|  ^ 
;*     ||           LDW     .D1     *D2++(4),BM1      ; [A_D1] |12|  ^ 
;*   2              NOP     0x5     ; [A_B] 
;*   7              MPYWW   .M2     BM2,BM0,BL0       ; [B_M2] |12|  ^ 
;*     ||           MPYWW   .N2     BM3,BM1,BL1       ; [B_N2] |12|  ^ 
;*   8              NOP     0x3     ; [A_B] 
;*  11              ADDW    .L2     BL1,BL0,B0        ; [B_L2] |12|  ^ 
;*  12              STW     .D1X    B0,*D0++(4)       ; [A_D1] |12|  ^ 
;*     ||           BNL     .B1     ||$C$C51||        ; [A_B] |10| 
;*  13              ; BRANCHCC OCCURS {||$C$C51||}    ; [] |10| 
;*----------------------------------------------------------------------------*

The final software pipelined initiation interval of the software pipelined loop is at least the greater of the Loop Carried Dependency Bound and the Partitioned Resource Bound. When the Loop Carried Dependency Bound value is greater than the Partitioned Resource Bound value, this indicates the code has a loop-carried dependency bound problem that likely should be addressed. In other words, when the loop-carried dependence bound is greater than the partitioned resource bound, the software pipelined loop could likely run faster if the loop-carried dependency bound is eliminated. Therefore in this example, because the partitioned resource bound is 2 and the loop-carried dependency bound is 12, this code has an issue that should be investigated.

To identify the problem, we need to look at the instructions involved in the loop-carried dependency. These instructions are marked with the caret "^" symbol in the comment block in the compiler-generated assembly file. Notice that the load and store instructions are marked with a caret. This tells us the compiler thinks there may be a loop-carried dependence between successive iterations. This is likely because the compiler cannot prove the stores are writing to an area of memory that is independent of the location from which the load instructions are loading values. In absence of information about the locations of the pointers, arrays and address access patterns, the compiler must assume that successive iterations may load from the location of the previous iteration's stores. See Hand-Tuning Loops and Control Code on the TMS320C6000 (SPRA666) for more about loop-carried dependencies and how to identify them.