# C7000 C/C++ Optimization Guide

User's Guide



Literature Number: SPRUIV4C MAY 2020 – REVISED DECEMBER 2023

# **Table of Contents**



| Read This First                                                         |                 |
|-------------------------------------------------------------------------|-----------------|
| About This Manual                                                       |                 |
| Related Documentation                                                   |                 |
| 1 Introduction                                                          |                 |
| 1.1 C7000 Digital Signal Processor CPU Architecture Overview            |                 |
| 1.2 C7000 Split Datapath and Functional Units                           |                 |
| 2 C7000 C/C++ Compiler Options                                          |                 |
| 2.1 Overview                                                            |                 |
| 2.2 Selecting Compiler Options for Performance                          |                 |
| 2.3 Understanding Compiler Optimization                                 | 14              |
| 2.3.1 Software Pipelining                                               | 14              |
| 2.3.2 Vectorization and Vector Predication                              |                 |
| 2.3.3 Automatic Use of Streaming Engine and Streaming Address Generator |                 |
| 2.3.4 Loop Collapsing and Loop Coalescing                               | 22              |
| 2.3.5 Automatic Inlining                                                | <mark>22</mark> |
| 2.3.6 If Conversion                                                     | 23              |
| 3 Basic Code Optimization                                               |                 |
| 3.1 Signed Types for Iteration Counters and Limits                      | 26              |
| 3.2 Floating-Point Division                                             |                 |
| 3.3 Loop-Carried Dependencies and the Restrict Keyword                  | 26              |
| 3.3.1 Loop-Carried Dependencies                                         |                 |
| 3.3.2 The Restrict Keyword                                              | 28              |
| 3.3.3 Run-Time Alias Disambiguation                                     | 28              |
| 3.4 Function Calls and Inlining                                         | 29              |
| 3.5 MUST_ITERATE and PROB_ITERATE Pragmas and Attributes                | 29              |
| 3.6 If Statements and Nested If Statements                              | 30              |
| 3.7 Intrinsics                                                          |                 |
| 3.8 Vector Types                                                        | 30              |
| 3.9 C++ Features to Use and Avoid                                       | 30              |
| 3.10 Streaming Engine                                                   | 31              |
| 3.11 Streaming Address Generator                                        |                 |
| 3.12 Optimized Libraries                                                | 31              |
| 3.13 Memory Optimizations                                               |                 |
| 4 Understanding the Assembly Comment Blocks                             | 33              |
| 4.1 Software Pipelining Processing Stages                               |                 |
| 4.2 Software Pipeline Information Comment Block                         |                 |
| 4.2.1 Loop and Iteration Count Information                              |                 |
| 4.2.2 Dependency and Resource Bounds                                    |                 |
| 4.2.3 Initiation Interval (ii) and Iterations                           |                 |
| 4.2.4 Constant Extensions                                               |                 |
| 4.2.5 Resources Used and Register Tables                                |                 |
| 4.2.6 Stage Collapsing                                                  |                 |
| 4.2.7 Memory Bank Conflicts                                             |                 |
| 4.2.8 Loop Duration Formula                                             |                 |
| 4.3 Single Scheduled Iteration Comment Block                            |                 |
| 4.4 Identifying Pipeline Failures and Performance Issues                |                 |
| 4.4.1 Issues that Prevent a Loop from Being Software Pipelined          |                 |
| 4.4.2 Software Pipeline Failure Messages                                | 40              |
| 4.4.3 Performance Issues                                                | 41              |
|                                                                         |                 |



Table of Contents www.ti.com

| 5 Revision History                                      | 42 |
|---------------------------------------------------------|----|
| List of Figures                                         |    |
| Figure 1-1. C7000 Datapath Block Diagram                | 9  |
| Figure 2-1. C7000 Compiler Processing Stages            |    |
| Figure 2-2. Effects of Software Pipelining on Execution |    |
| Figure 2-3 Loop Iterations with Prolog and Epilog       | 15 |



#### **About This Manual**

This document is for those who want to improve the performance of code running on C7000<sup>™</sup> CPUs.

This guide is not intended to help optimize code for the memory/cache hierarchy, MSMC, DMA, or Matrix Multiply Accelerator (MMA).

Readers of this document should have the following:

- Knowledge of C and C++.
- · Experience invoking the C7000 compiler using compiler options.
- · Knowledge of basic assembly language concepts.
- Knowledge of CPU architectural features like registers, caches, and functional units.

#### **Related Documentation**

Use the following documents from Texas Instruments to supplement this user's guide:

- SPRUIG8 C7000 Optimizing C/C++ Compiler User's Guide
- SPRUIG4 C7000 Embedded Application Binary Interface (EABI) Users Guide
- SPRUIU4 C7x Instruction Guide (available through your TI Field Application Engineer)
- **SPRUIP0** C71x DSP CPU, Instruction Set, and Matrix Multiply Accelerator Technical Reference Manual (available through your TI Field Application Engineer)
- **SPRUIQ3** C71x DSP Corepac Technical Reference Manual (available through your TI Field Application Engineer)
- SPRU425 C6000™ Optimizing C Compiler Tutorial
- SPRA666 Hand-Tuning Loops and Control Code on the TMS320C6000™
- SPRABK5 Throughput Performance Guide for KeyStone™ II Devices
- SPRUIG5 C6000-to-C7000 Migration Users Guide

#### Trademarks

C7000<sup>™</sup>, C6000<sup>™</sup>, TMS320C6000<sup>™</sup>, and KeyStone<sup>™</sup> are trademarks of Texas Instruments. All trademarks are the property of their respective owners.



Read This First www.ti.com

This page intentionally left blank.

# Chapter 1 Introduction



Before describing compiler options and source code strategies you can use to make code more efficient, it is necessary to know some information about the C7000 Digital Signal Processor and instruction set. This chapter provides an overview of the C7000 architecture, datapath, and functional units.

| 1.1 C7000 Digital Signal Processor CPU Architecture Overview | 3 |
|--------------------------------------------------------------|---|
| 1.2 C7000 Split Datapath and Functional Units                |   |



Introduction www.ti.com

#### 1.1 C7000 Digital Signal Processor CPU Architecture Overview

The C7000 CPU DSP architecture is the latest high-performance digital signal processor (DSP) from Texas Instruments. It is featured in some Texas Instruments Keystone 3 devices. This Very-Long-Instruction-Word (VLIW) DSP has significant mathematical processing capabilities, due to its wide vector instructions and multiple functional units. This optimization guide can help developers get the most performance of the C7000 DSPs.

When integrated into a larger TI device, such as some Keystone 3 devices, the C7000 is often paired with a Matrix Multiply Accelerator (MMA), which can significantly improve the performance of certain machine learning networks. We recommend use of the TI Deep Learning library, which has been optimized to use the Matrix Multiply Accelerator. The TI Deep Learning library is part of the Processor SDK.

The C7000 DSP has vector (SIMD) instructions that are capable of performing up to 64 operations in a single instruction, depending on the data type and version of the C7000 CPU. Nearly all computational instructions on C7000 DSP cores are fully pipelined, which means independent instructions can be started on every clock cycle. This combination of vector instructions and pipelined behavior allows you to perform a large number of computations per cycle. The C7000 DSP cores feature both fixed-point and floating-point vector instructions.

Each C7000 DSP core has several functional units. On each clock cycle, each functional unit can be executing an independent instruction. In this guide, we focus on the first generation of C7000 DSP cores, the C7100 and C7120. Because the C7100 and C7120 DSP cores have 13 functional units, there are 13 instructions that can execute every clock cycle. In reality, some of the functional units are specialized for certain kinds of instructions, so for this and other reasons, it is common that not all 13 functional units execute an instruction every cycle.

For more information on the C7000 instruction set, please see the C71x DSP CPU, Instruction Set, and Matrix Multiply Accelerator Technical Reference Manual (SPRUIP0).

www.ti.com Introduction

#### 1.2 C7000 Split Datapath and Functional Units

The following block diagram shows the datapath split on the C7100 DSP CPU. There is an A-side datapath and a B-side datapath. The diagram shows the functional units and multiple, heterogeneous register files. The A-side datapath is responsible for scalar computation, loading and storing scalars and vectors to and from memory, and control-flow (branches, calls). The B-side datapath handles vector math operations, permutations of data, and vector predication operations.



Figure 1-1. C7000 Datapath Block Diagram

To simplify the image above, some data movement capabilities and data paths are not shown in this figure.

- In general, a functional unit can write to any register file on the same datapath.
- Most functional units can obtain data from one or both of the streaming engines.
- There is one 64-bit cross path per datapath (A/B). Each cross path allows one read per cycle from the
  opposite side global register file.



Introduction www.ti.com

C7100 and C7120 devices have a 512-bit vector width. C7504 and C7524 devices have a 256-bit vector width. Registers have 64 bits per register ("scalar") or a "vector-width" number of bits per register. Thus, C7100 and C7120 devices have 512-bit vector registers, while C7504 and C7524 devices have 256-bit vector registers.

On a given datapath, there are several different kinds of register files. On a given datapath, each functional unit can write to the global register file on that datapath and most of the "local" register files on that datapath. However, only some functional units can read from a "local" register file.

- **D1 and D2 units:** These reside on the A-side datapath and can load from and store to memory. Two 64-bit loads can execute in parallel. Two 64-bit stores can execute in parallel. A 64-bit and vector-width load can execute in parallel with a 64-bit or vector-width store. It is not possible for two vector-width stores to execute in parallel or two vector-width loads to execute in parallel.
- L1, S1, M1, and N1 units: These are general-purpose functional units, handling a varied mix of scalar and small vector computation. The M1 and N1 functional units perform various multiplication instructions.
- L2, S2, M2, and N2 units: These are also general-purpose functional units, and can operate on full-width vector data. The M2 and N2 functional units perform various multiplication instructions.
- B unit: This unit handles indirect branches and calls.
- C unit: This unit performs permutations and shuffles of data.
- **P unit:** This unit computes predicates used to mask off vector lanes so particular lanes are not computed or are not stored to memory.

In addition to the D1 and D2 units providing CPU access to the memory hierarchy, the C7100 DSP has two "streaming engines" that facilitate a fast path to obtain data from memory. A *streaming engine* is a hardware feature that allows you (or the compiler) to specify a pattern of memory addresses to obtain from memory. The streaming engine will do its best to pre-fetch that data from the memory hierarchy into a scratchpad memory close to the CPU, to minimize CPU stalls due to cold cache misses.

### Chapter 2

# C7000 C/C++ Compiler Options



This chapter describes the Texas Instruments C7000 C/C++ Compiler and options that can be used to optimize performance.

| 2.1 Overview                                   | , |
|------------------------------------------------|---|
| 2.2 Selecting Compiler Options for Performance |   |
| 2.3 Understanding Compiler Optimization        |   |



#### 2.1 Overview

The Texas Instruments C7000 compiler accepts C or C++ source input. When compiling, the compiler proceeds through several stages, as shown in the following figure



Figure 2-1. C7000 Compiler Processing Stages

First, the source file is parsed to create a high-level intermediate representation that closely resembles the source language, but is more tailored for optimization transformations.

Files and functions (optionally) compiled with some level of optimization pass through the high-level optimizer, which performs function inlining, loop transformations, and other code optimizations.

Next, the high-level intermediate language is translated into a low-level intermediate language, which closely resembles assembly. The low-level optimizer and code generation pass performs partitioning, register allocation, software pipelining, instruction scheduling, and other optimizations.

The output of the code generation pass is the assembly file, which is assembled into an object file by the assembler and then linked into a library or executable by the linker.



#### 2.2 Selecting Compiler Options for Performance

After your application has been fully debugged and is working properly, it is time to begin the optimization process. First, you need to select appropriate compiler options. The following compiler options affect performance. See the *C7000 C/C++ Compiler User's Guide* (SPRUIG8) for more details on command-line options.

- --opt\_level=3 (-o3). The compiler performs function-level optimization at -opt\_level=2 and file-level optimization and function inlining at --opt\_level=3. At both --opt\_level=2 and --opt\_level=3, the compiler performs various loop optimizations, such as software pipelining, vectorization, and loop coalescing. By default, the --opt\_level switches optimize for performance. Such optimization can increase code size. If code size is an issue, do not reduce the level of optimization. Instead use the --opt\_for\_speed (-mf) switch to change the optimization goal (performance versus code size) and the -oi option to control the amount of automatic inlining.
- --opt\_level=4 (-o4). Consider using this option to perform optimizations across all files at link-time. Using
  this option can increase compile time significantly. If used for any step, this option must be used at all
  compilation and linking steps. Source files can be compiled separately, as long as they are all compiled with
  --opt\_level=4. This optimization level cannot be used with --program\_level\_compile (-pm).
- --gen\_func\_subsections (-mo). Consider using this option if the source code uses many functions that are never called. This option places each function in its own input subsection, so the linker can exclude that function from the executable if it is never referenced. However, this optimization can increase code size, because there are minimum section alignment requirements the compiler must apply.
- --opt\_for\_speed=0 (-mf0) or --opt\_for\_speed=1 (-mf1). If code size is a concern, use these options when compiling files with functions that are not executed often or are not critical to performance. This tells the compiler to optimize for code size instead of performance. Do not lower the optimization level (--opt\_level) in an attempt to lower code size.

Do not use the --disable\_software\_pipelining (-mu) option if you are concerned about performance. This option turns off software pipelining. Software pipelining is critical to achieving high performance on most loops. This option can be a debugging tool, as it makes the assembly code easier to understand.

The following options provide additional information for debugging and performance evaluation purposes:

- --src\_interlist (-s). This option causes the compiler to emit into the compiler-generated assembly files a copy of what the source code looks like after high-level optimization. This output is placed in the assembly files as comments among the assembly code. The comments output from the optimizer look like C code and show the high-level transformations that have been applied such as inlining, loop coalescing, and vectorization. This option can be useful in helping you understand the assembly code and some of what the compiler is doing to optimize the performance of the code. This option turns on the --keep\_asm (-k) option, so the compiler-generated assembly (.asm) files will not be deleted.
  - --debug\_software\_pipeline (-mw). This option emits extra information about software-pipelined loops, including the single-scheduled iteration of the loop. This information is used in loop tuning examples presented later in this document. This option turns on the --keep\_asm (-k) option, so the compiler-generated assembly (.asm) files will not be deleted.
  - **--gen\_opt\_info=2 (-on2).** This option creates a .nfo file with the same base name as the .obj file. This file contains summary information regarding the high-level optimizations that have been applied, as well as providing advice.



#### 2.3 Understanding Compiler Optimization

Before you can interpret the assembly code and the software pipelining information within, it helps to understand some of what the compiler is trying to do with the C/C++ source code as it compiles it into assembly code.

#### 2.3.1 Software Pipelining

Very-Long Instruction Word (VLIW) digital signal processors (DSPs) like the C7000 depend on software pipelining of loops to achieve maximum performance. *Software pipelining* is a technique in which successive iterations of a source loop are overlapped so that the functional units on the CPU are utilized on as many cycles as possible throughout the loop.

The following figure shows loop iteration execution both without and with software pipelining. You can see that without software pipelining, loops are scheduled so that loop iteration *i* completes before iteration *i+1* begins. With software pipelining, iterations overlap. Thus, as long as correctness can be preserved, iteration *i+1* can start before iteration *i* finishes. This generally permits a much higher utilization of the machine's resources than might be achieved from other scheduling techniques. In a software-pipelined loop, even though a single iteration might take *s* cycles to complete, a new iteration is initiated every *ii* cycles.



Figure 2-2. Effects of Software Pipelining on Execution

In an efficient software pipelined loop, ii is much less than s. ii is called the initiation interval; it is the number of cycles between starting iteration i and starting iteration i+1. s is the number of cycles for the first iteration to complete, or equivalently, the length of a single scheduled iteration of the software-pipelined loop.

The compiler attempts to software pipeline the innermost source loops. These are loops that do not have any other loops within them. Note that during the compilation process, software pipelining occurs after inlining and after loop transformations that may combine loops, so in certain cases you may see the compiler software pipelining more of your code than you expect.



After software pipelining, the loop has three major phases, as shown in the following figure:

- pipe-up (prolog) phase during which the overlapped iterations are started.
- **steady-state (kernel)** phase during which iterations continue to be started.
- · pipe-down (epilog) phase during which any iterations that have not yet completed are allowed to finish.



Figure 2-3. Loop Iterations with Prolog and Epilog

The following example shows the source code for a simple weighted vector sum.

To simplify this first software-pipelining example, two pragmas are used:

- The UNROLL pragma tells the compiler not to perform vectorization, which is a transformation technique that is demonstrated in the next section.
- The MUST\_ITERATE pragma conveys information on how many times the loop executes and is explained later in this document. The example uses this pragma to prevent a "duplicate loop" from being generated, which is also explained later in this document.

Then we compile this code with the following command:

```
cl7x --opt_level=3 --debug_software_pipeline --src_interlist --symdebug:none weighted_vector_sum.cpp
```

The --symdebug: none option prevents the compiler from generating debug information and the associated debug directives in the assembly. This debug information is not relevant to the discussion in this document and if included, would unnecessarily lengthen the examples shown here. Normally, you would not turn off debug generation as the generation of debug information does not degrade performance.



Because the --src\_interlist option is used, the compiler-generated assembly file is not deleted and has the following contents:

```
* * * * * * * * * * * * * *
      SOFTWARE PIPELINE INFORMATION
          Loop found in file
                                                     : weighted_vector_sum.cpp
          Loop source line
                                                     : 10
          Loop opening brace source line
Loop closing brace source line
                                                     : 11
                                                    : 13
          Known Minimum Iteration Count
                                                    : 1024
          Known Max Iteration Count Factor: 32
          Loop Carried Dependency Bound(^) : 0
          Unpartitioned Resource Bound
                                                     : 2 (pre-sched)
          Partitioned Resource Bound
          Searching for software pipeline schedule at ...
ii = 2 Schedule found with 7 iterations in parallel
;*
          Partitioned Resource Bound(*)
                                                  : 2 (post-sched)
;*
;*
            SINGLE SCHEDULED ITERATION
            ||$C$C36||:
                                                                      ; [A_U]
; [A_D1] |12|
      0
                          TICK
      1
                          SLDW
                                    .D1
                                              D1++(4),BM0
                                              *D2++(4),BM1
                                                                      ; [A_D2] |12|
         \prod
                          SLDW
                                    .D2
      2
                          NOP
                                    0x5
                                               ; [A_B]
× × × × ×
                                                                      ; [B_N] |12|
; [B_M2] |12|
      7
                         MPYWW
                                               BM2,BM0,BL0
                                    . N2
         \prod
                          {\sf MPYWW}
                                     .M2
                                               BM3,BM1,BL1
      8
                          NOP
                                    0x3
                                               ; [A_B]
                                              BL1,BL0,B0
B0,*D0++(4)
                                                                     ; [B_L2] |12|
; [A_D1] |12|
; [A_B] |10|
     11
                          ADDW
                                    .L2
     12
                          STW
                                    .D1X
                                               ||$c$c36||
                          BNL
                                    .в1
;*
     13
                         ; BRANCHCC OCCURS {||$C$C36||}
                                                                      ; [] |10|
||$C$L1||: ; PIPED LOOP PROLOG
              EXCLUSIVE CPU CYCLES: 8
                                                          ; [A_U] (R) (SP) <1,0>
; [A_D1] |12| (P) <1,1>
; [A_D2] |12| (P) <1,1>
              SLDW
                         .D1
                                   D1++(4), BM1
                                   *D2++(4),BM0
| |
              SLDW
                         . D2
                                                          ; [B_L2] |7| (R)
              MV
                        .L2X
                                   A7,B0
II
                                                            ; [A_U] (P) <2,0>
              TICK
                         .L2X
                                   A8,B1
                                                          ; [B_L2] |7| (R)
; [A_D1] |12| (P) <2,1>
; [A_D2] |12| (P) <2,1>
              ΜV
                                   *D1++(4),BM0
              SLDW
                         .D1
                                   D2++(4), BM1
П
              SLDW
                         .D2
                         .s2
              ΜV
                                   B0,BM2
                                                          ; [B_S2] (R)
                                                          ; [B_L2] (R)
; [A_U] (P) <3,0>
              MV
                         .L2
                                   B1,BM3
              TICK
                        .N2
              MPYWW
                                   BM2,BM1,BL0
                                                          ; [B_N] | 12 | (P) < 0.7 >
                                                          ; [B_M2] |12| (P) <0,7>
; [A_D1] |12| (P) <3,1>
              MPYWW
                        .M2
                                   BM3,BM0,BL1
              SLDW
                         .D1
                                   *D1++(4),BM0
| |
              SLDW
                         .D2
                                   *D2++(4),BM1
                                                          ; [A_D2] |12| (P) <3,1>
              TICK
                                                           ; [A_U] (P) <4,0>
                                                          ; [B_N] |12| (P) <1,7>
; [B_M2] |12| (P) <1,7>
; [A_D1] |12| (P) <4,1>
; [A_D2] |12| (P) <4,1>
              MPYWW
                         .N2
                                   BM2,BM1,BL0
                                   BM3,BM0,BL1
              MPYWW
                        .M2
                        .D1
                                   *D1++(4),BM0
              SLDW
              SLDW
                         .D2
                                   D2++(4), BM1
                         .D2
              MV
                                                          ; [A_D2] (R); [A_D1] (R)
                                                             [A_D2] (R)
                                   SP, 0xfffffff8, SP
              ADDD
                        .D1
                                                           ; [A_U] (P) <5,0>
              TICK
```



```
||$C$L2||:
                : PIPED LOOP KERNEL
            EXCLUSIVE CPU CYCLES: 2
                              BL1,BL0,B0
                                                     [B_L2] |12| <0,11>
            ADDW
                     .L2
                                                     [B_N] | 12 | <2,7>
[B_M2] | 12 | <2,7>
            MPYWW
                     . N2
                              BM2,BM0,BL0
                              BM3,BM1,BL1
            MPYWW
                     . M2
            SL DW
                     .D1
                              *D1++(4),BM0
                                                     ΓA D17
                                                             1121
                                                                   <5.1>
                              *D2++(4),BM1
            SLDW
                     .D2
                                                     [A_D2]
                                                             |12|
            BNI
                     .в1
                              ||$C$L2||
                                                     [A_B]_|10|<0,12>
                              B0,*D0++(4)
                                                    [A_D1] |12| <0,12>; [A_U] <6,0>
            STW
                     .D1X
            TICK
 **
||$C$L3||:
                 PIPED LOOP EPILOG
            EXCLUSIVE CPU CYCLES: 7
,
• **
                                        return;
            ADDD
                     .D2
                              SP,0x8,SP
                                                     [A_D2](0)
                              *SP(16),A9
                                                     [A_D1](0)
            I DD
                     .D1
                                                     [B_L2] |12| (E) <4,11>
            ADDW
                     .L2
                              BL1,BL0,B0
            MPYWW
                     . N2
                              BM2,BM0,BL1
                                                     [B_N] | 12 | (E) < 6.7 >
                              BM3,BM1,BL0
                                                   ; [B_M2] | 12 | (E) < 6.7 >
            MPYWW
                     .M2
                              B0,*D0++(4)
            STW
                     .D1X
                                                     [A_D1]
                                                             |12|
                                                                   (E) <4.12>
            ADDW
                     .L2
                              BL1,BL0,B0
                                                     [B_L2]
                                                             |12|
                                                                   (E) < 5,11 >
            STW
                     .D1X
                              B0,*D0++(4)
                                                     [A_D1]
                                                             |12|
                                                                   (E) < 5,12 >
                                                             12
            ADDW
                     .L2
                              BL0, BL1, B0
                                                     [B_L2]
                                                                   (E) < 6,11 >
                                                                   (E) < 6,12 >
            STW
                     .D1X
                              B0.*D0++(4)
                                                     [A_D1] |12|
            RET
                     .B1
                              ; [A_B] (0)
\prod
                                                    ; [A_U] (E)
            PROT
            ; RETURN OCCURS {RP}
                                                   ; [] (0)
```

This assembly output shows the software pipelined loop from the compiler-generated assembly file along with part of the software pipelining information comment block, which includes important information about various characteristics of the loop.

If the compiler successfully software pipelines a loop, the compiler-generated assembly code contains a software pipeline information comment block that contains a message about "ii = xx Schedule found with yy iterations in parallel". The *initiation interval*, (ii), is a measure of how often the software pipelined loop is able to start executing a new iteration of the loop. The smaller the initiation interval, the fewer cycles it will take to execute the entire loop. The software-pipelined loop information also includes the source lines from which the loop originates, a description of the resource and latency requirements for the loop, and whether the loop was unrolled (among other information). When compiling with -mw, the information also contains a copy of the single scheduled iteration.

In this example, the achieved initiation interval (ii) is 2 cycles, and the number of iterations that will run in parallel is 7

The comment block also includes a *single-scheduled iteration* view of the software pipelined loop. The single-scheduled iteration view of the software pipelined loop allows you to see how the compiler transformed the code and how the compiler scheduled one iteration of the software pipelined loop overlap iterations in software pipelining. See Section 4.2 for more information on how to interpret the information in this comment block.

#### 2.3.2 Vectorization and Vector Predication

The C7000 instruction set has many powerful single-instruction, multiple-data (SIMD) instructions that can perform multiple operations in a single instruction. To take advantage of this, the compiler tries to *vectorize* the source code when possible and profitable. Vectorization usually involves using vector (SIMD) instructions to perform an operation on several loop iterations of data at a time.



The following example removes the UNROLL pragma and the MUST\_ITERATE pragma from the example in the previous section. The UNROLL(1) pragma prevented certain loop-transformation optimizations in the C7000 compiler.

The following shows the resulting internal compiler code, which has been vectorized. Vectorization by the compiler can be inferred by the "+= 16" address increments and "32x16" in the names of optimizer temporary variables (to indicate there are 16 32-bit elements in the temporary variable).

```
;*** 6
                                     if (!((d\$1 == 1)\&U\$33)) goto q5;
;*** 6
                                     VP$25 = VP$24;
***
;*** 7
:*** 7
                                     VP$20 = VP$25:
                                       _vstore_pred_p_P64_S32(VP$20, &(*(packed int (*)<[16]>)U$47),
*(packed int (*)<[16]>)U$38*VRC$s32x16$001+*(packed int (*)<[16]>)U$42*VRC$s32x16$002);
                                     U$38 += 16;
                                     U$42 += 16;
:*** 6
;*** 6
                                     U$47 += 16;
;*** 6
                                     --d$1;
:*** 6
                                     if ( L$1 = L$1-1 ) goto g3;
```

The software pipeline information block from the resulting assembly file is as follows:

```
SOFTWARE PIPELINE INFORMATION
Loop found in file
                                             : weighted_vector_sum_v2.cpp
        Loop source line
                                               6
        Loop opening brace source line
                                               6
        Loop closing brace source line
                                               8
        Loop Unroll Multiple
                                             : 16x
         Known Minimum Iteration Count
                                               1
         Known Max Iteration Count Factor
                                               1
,
, *
        Loop Carried Dependency Bound(^):
         Unpartitioned Resource Bound
        Partitioned Resource Bound
                                               2 (pre-sched)
        Searching for software pipeline schedule at ...
;*
           ii = 2 Schedule found with 7 iterations in parallel
*
           SINGLE SCHEDULED ITERATION
;*
;*
           ||$C$C41||:
;*
     0
                      TICK
                                                               [A_U]
                                                              [A_D1]
                      VLD16W
                                        *D0++(64), VBM0
                                                                     |7| [SI]
     1
                               .D1
VLD16W
                                        *D1++(64),VBM0
                                                              [A_D1] |7| [SI]
                                        0x4
                                                              [A_B]
                      NOP
                      VMPYWW
                               . N2
                                        VBM2, VBM0, VBL0
                                                              [B_N2] |7|
                                        VBM1,VBM0,VBL1
     8
                               .N2
                                                              [B_N2]
[A_L1]
                      VMPYWW
                                                                      171
     9
                      CMPEQW
                               .L1
                                        AL0,0x1,D3
                                                                      161
    10
                      ANDW
                               .D2
                                        D2, D3, AL1
                                                             [A_D2] |6|
                                        ALO,0xfffffffff,ALO; [A_L1] |6|
       | | |
                      ADDW
                               .L1
                                        AL1,0,A0
                                                              [A_S1] |6|
    11
                      CMPEQW
                               . S1
             [!A0]
                                                              [B_P] |6| CASE-1
[B_L2] |7|
    12
                      MV
                               .P2
                                        P1,P0
                                        VBL1, VBL0, VB0
       | |
                      VADDW
                               .L2
    13
                      VSTP16W .D2
                                        P0,VB0,*A1(0)
                                                              [A_D2] |7|
;
;
;
;
                                        A1,0x40,A1
||$C$C41||
                                                              [A_M1] |6| [C1]
[A_B] |6|
                               .M1
                      ADDD
                      BNL
                               .B1
    14
                      ; BRANCHCC OCCURS {||$C$C41||}
                                                              [] |6|
```



This example compares the output from that in the previous section to show these effects of vectorization:

- The "optimizer" code after several high-level optimization steps, including vectorization. (This "optimizer" code appears in the assembly when using the -os compiler option.) The address increments are by 16 and there are optimizer temporary variables with the partial name of 32x16, indicating 16 32-bit elements.
- The "SOFTWARE PIPELINE INFORMATION" comment block in the assembly file shows that the loop has been unrolled by 16x. This may or may not indicate vectorization has occurred, but is often associated with vectorization.
- The software pipelined loop now uses the VMPYWW and VADDW instructions. The 'V' in the instruction
  mnemonics often (but not always) indicates that the compiler has vectorized a code sequence (using vector/
  SIMD instructions).
- · Larger address increments in load and store instructions can be another clue that vectorization has occurred.

In this loop, the compiler does not know how many times the loop will execute. Therefore in our example, the compiler must not store to memory an entire vector on the last loop iteration if the number of loop iterations is not a multiple of the number of elements in the vector width that was chosen. For example, if the original (unvectorized) loop will execute 40 iterations and the compiler vectorized the loop by 16, the last optimized iteration will compute 16 elements, but only 8 of them should be stored to memory.

The C7000 ISA has certain vector predication features, where a vector predicate affects which lanes of a vector operation should be performed. In this case, a BITXPND instruction generates a vector predicate that is used in a vector-predicate-aware store instruction. This vector store instruction (VSTP16W) uses the vector predicate to prevent storing to memory those elements on the last iteration that were computed only as a result of the vectorization process and would not have been computed or stored in the original loop. The compiler attempts to perform vector predication automatically during the vectorization process. Vector predication helps avoid the need for generating peeled loop iterations, which can inhibit loop nest optimizations.

#### Note

Vector predicated stores may lead to page faults if the Corepac Memory Management Unit (CMMU) is enabled and the store overlaps an illegal memory page. Any memory range that will be within 63 bytes of an illegal memory page at run-time should be reduced in length in the linker command file. For more information, see the *C7000 C/C++ Compiler User's Guide* (SPRUIG8).

You can avoid vector prediction if you give the compiler information about the number of loop iterations using the MUST\_ITERATE pragma. For example, if the loop in the previous example is known to execute only in multiples of 32 and the minimum iteration count is 1024, then the following example improves the generated assembly code:



When compiled, this modified example generates the following software pipeline information block:

```
SOFTWARE PIPELINE INFORMATION
Loop found in file
                                                  weighted_vector_sum_v3.cpp
         Loop source line
                                                  7
                                                  7
         Loop opening brace source line
                                                  9
         Loop closing brace source line
         Loop Unroll Multiple
                                                  32x
         Known Minimum Iteration Count
                                                  32
         Known Max Iteration Count Factor
                                                  0
         Loop Carried Dependency Bound(^):
         Unpartitioned Resource Bound
                                                  4
                                                     (pre-sched)
         Partitioned Resource Bound
         Searching for software pipeline schedule at
             ii = 4 Schedule found with 5 iterations in parallel
           SINGLE SCHEDULED ITERATION
           ||$C$C36||:
     0
                       TICK
                                                                   [A_U]
                       VLD16W
                                 .D1
                                          *D1++(128), VBMO
                                                                  [A_D1]
                                                                          |8|
                                                                               [SI][C1]
                                          *D1(-64),VBM0
*D2++(128),VBM0
                                                                  [A_D1]
[A_D1]
                                                                          |8| [c1]
|8| [sɪ][c1]
      2
* * * * * *
                       VLD16W
                                 .D1
                       VLD16W
                                 .D1
     4
5
                                          *D2(-64),VBM0
                       VLD16W
                                 .D1
                                                                  [A_D1] |8| [C1]
                       NOP
                                          0x2
                                                                  [A_B]
                       VMPYWW
                                          VBM2, VBM0, VBL1
                                                                  [B_N2]
                                          VBM2, VBM0, VBL0
VBM1, VBM0, VBL2
* * * * * * * * *
     8
                       VMPYWW
                                 .N2
                                                                  [B_N2]
[B_N2]
                                                                          181
     9
                       VMPYWW
                                                                          iši
                                 .N2
    10
                       VMPYWW
                                 . N2
                                          VBM1, VBM0, VBL1
                                                                  [B_N2] |8|
    11
                       NOP
                                          0x2
                                                                  [A_B]
    13
                       VADDW
                                          VBL2, VBL1, VB0
                                                                  [B_L2]
                                          VB0,*D0(0)
VBL1,VBL0,VB0
                                                                  [A_D2]
[B_L2]
                       VST16W
                                 .D2
    14
                                                                          |8|
                                 .L2
                                                                          181
                       VADDW
    15
                       VST16W
                                 .D2
                                          VB0,*D0(64)
                                                                  [A_D2] |8|
                                                                               [C0]
                                                                  [A_D2] |7|
[A_B] |7|
    16
                       ADDD
                                 .D2
                                          D0,0x80,D0
                                                                               [C0]
                                          ||$c$c36||
                                 .B1
        | |
                       BNL
    17
                       ; BRANCHCC OCCURS {||$C$C36||}
                                                                  [] |7|
```

Due to the added MUST\_ITERATE pragma, the compiler knows that vector predication is never needed and does not perform vector predication. As a result, the compiler removes the CMPEQW, ANDW, VSTP16W, and other instructions associated with the vector predication.



#### 2.3.3 Automatic Use of Streaming Engine and Streaming Address Generator

The compiler can use the Streaming Engine (SE) and/or the Streaming Address Generator (SA) automatically if the --auto\_stream=no\_saving option is used on C7100 and C7120 devices or the --auto\_stream=saving option is used on C7504 and later devices.

If the weighted\_vector\_sum\_v3.cpp example in Section 2.3.2 is compiled with the --auto\_stream=no\_saving option, the following software pipeline information block is generated. (The generated assembly in this example is for C7100.)

```
SOFTWARE PIPELINE INFORMATION
         Loop found in file
                                              : weighted_vector_sum_v2.cpp
* * * * * *
         Loop source line
         Loop opening brace source line
         Loop closing brace source line
                                                9
         Loop Unroll Multiple
                                                32x
         Known Minimum Iteration Count
                                                32
         Known Max Iteration Count Factor
                                                1
         Loop Carried Dependency Bound(^)
         Unpartitioned Resource Bound
                                                2
                                                  (pre-sched)
         Partitioned Resource Bound
         Searching for software pipeline schedule at
, * , * , * , * , * ,
            ii = \overline{2} Schedule found with 4 iterations in parallel. . .
           SINGLE SCHEDULED ITERATION
           ||$c$c36||:
     0
                      TICK
                                                                [A_U]
                                        VBM1,SE0++,VBL0
                               .N2
                                                                       |8|
     1
                      VMPYWW
                                                               [B_N2]
                                                                            ٨
VMPYWW
                               .M2
                                        VBM0, SE1++, VBL1
                                                               [B_M2]
                                                                       181
                                                                            ٨
     2
                      VMPYWW
                               . N2
                                        VBM1,SE0++,VBL0
                                                                [B_N2]
        \prod
                      VMPYWW
                               .M2
                                        VBM0, SE1++, VBL1
                                                               [B_M2] |8|
     3
                      NOP
                                        0x2
                                                                Га в1
                                        VBL1,VBL0,VB0
VB0,*D0(0)
     5
                      VADDW
                               .12
                                                                       181
                                                                Гв L21
     6
                      VST16W
                               .D2
                                                                [A_D2]
                                                                       18
                               .L2
                                        VBL1, VBL0, VB0
       | | |
                      VADDW
                                                               [B_L2]
     7
                      VST16W
                               .D2
                                        VBO,*DO(64)
                                                               [A_D2]
                                                                            [C0]
                                        D0,0x80,D0
                                                                       |7|
                                                                           [c1]
                               .D1
                      ADDD
                                                               [A_D1]
                      BNL
                                .B1
                                         ||$c$c36||
                                                               [A_B] |7|
;*
     8
                      ; BRANCHCC OCCURS {||$C$C36||}
                                                               [] |7|
```

In this case, the compiler uses SE0 and SE1 to replace the loads that previously set a lower ii bound of 4. With these loads instead being performed with SEs, an ii of 2 is achieved. To use the SEs in the above example, the compiler must configure and open them. The configuration and open actions are shown in comments added by the --src interlist option before the loop:

By default, the compiler uses the SE or SA only if using them appears to be profitable and legal.

For profitability, a key consideration is that using the SEs or SAs comes with a processing overhead; the compiler does not necessarily know whether this overhead is profitable. In the example, the MUST\_ITERATE pragma indicates the minimum iteration count is 1024, which convinces the compiler that use of SEs or SAs is likely profitable, so the compiler performs the transformation. If the compiler is not using the SE or SA and you want to cause the compiler to use them, indicating the number of iterations with the MUST\_ITERATE or PROB\_ITERATE pragma can help.



For legality, most reasons for not using the SE or the SA relate to whether an addressing pattern can always be mapped to an SE or SA. These reasons include, but are not limited to:

- Iteration counter (ICNT) values that exceed the range of an unsigned 32-bit type. For example, this occurs in for (i = 0; i < icnt; i++) when i and icnt are 64-bit types.
- DIM values that exceed the range of a signed 32-bit type. For example, this occurs in data\_in[i\*dim] when dim is a 64-bit type.
- Additions or multiplies in addressing that exceed the range of a signed 32-bit type. For example, this occurs in data\_in[i\*dim] when i or dim is a 64-bit type.
- Addressing exceeding the range of INT\_MIN to INT\_MAX elements. For example, in int16\_ptr[i] when int16\_ptr is an int16 \* and i is an int, the maximum range is INT\_MIN\*16 elements to INT\_MAX\*16 elements.

Each of these are edge cases that are unlikely to occur in practice. To allow the compiler to ignore them, use the --assume addresses ok for stream option.

If using the SE or SA is not profitable in practice, you can override the --auto\_stream and/or -- assume addresses ok for stream options for a single function using the FUNCTION OPTIONS pragma.

If the code explicitly uses the SE or SA in a function, the compiler does not choose to use either the SE or the SA for optimization. In this case, the compiler assumes that the code handles all aspects of optimization with the SE and SA within that function.

For further information on automatic use of the SE and SA and the associated compiler options, see the C7000 C/C++ Compiler User's Guide (SPRUIG8).

#### 2.3.4 Loop Collapsing and Loop Coalescing

The compiler attempts to *collapse* or *coalesce* nested loops if it is legal and can improve performance. A *nested loop* is a set of two loops where one loop resides inside of another enclosing loop. Both collapsing and coalescing involve transforming a nested loop into a single loop. Collapsing takes place when there is no code in the outer loop. Coalescing takes place when there is code in the outer loop.

After the two nested loops are combined into one loop, the code that was in the body of the outer loop must be transformed so that it conditionally executes only when necessary. Collapsing and coalescing can have performance benefits because only one pipe-up and pipe-down are executed when the loop nest is executed, instead of a pipe-down and pipe-up of the inner loop every time the outer loop executes when loop coalescing/collapsing is not performed.

In order to perform loop collapsing or loop coalescing, the combined loop must be able to be software pipelined. This means that the loop nest must not contain function calls. The loops must each have a signed counting iterator that iterates a fixed amount each time. That is, the inner loop must not iterate a different number of times depending on which outer loop iteration execution is in. Also, the outer loop must not contain too much code, otherwise the transformation will not improve performance. If the outer loop carries a memory dependence, loop coalescing and loop collapsing likely will not be performed.

When loop collapsing or loop coalescing take place, the software pipelined loop indicates the beginning loop source line ("Loop source line") near the top of the software information comment block. When this source line number references an outer loop, this indicates that the inner loop has been fully unrolled or the compiler has performed loop coalescing or collapsing. In cases of loop coalescing, the compiler uses special instructions, such as NLCINIT, TICK, GETP, and BNL. A description of these hardware features, encompassing what is known as the "NLC", is beyond the scope of this document. More details of the NLC may be found in the C71x DSP CPU, Instruction Set, and Matrix Multiply Accelerator Technical Reference Manual (SPRUIPO).

#### 2.3.5 Automatic Inlining

The compiler sometimes takes functions defined in header files and places the code at the call site. This allows software pipelining in an enclosing loop and thus improves performance. The compiler may also do this to eliminate the cost of calling and returning from a function.



In the following example, the add\_and\_saturate\_to\_255() function sums two values and caps the sum at 255 if the sum is over 255. This function is called from a function in inlining.cpp, which includes the inlining.h file via a preprocessor #include directive.

```
// inlining.cpp
// Compile with "cl7x -mv7100 --opt_level=3
// --debug software nineline --sec intell
      --debug_software_pipeline --src_interlist"
#include "inlining.h
void saturated_vector_sum(int * restrict a, int * restrict b,
                              int * restrict out, int n)
#pragma MUST_ITERATE(1024,,)
#pragma UNROLL(1)
    for (int i = 0; i < n; i++)</pre>
    {
         out[i] = add_and_saturate_to_255(a[i], b[i]);
    }
}
// inlining.h
int add_and_saturate_to_255(int a, int b)
    int sum = a + b;
    if (sum > 255) sum = 255;
    return sum;
```

In this case, the compiler will inline the call to add\_and\_saturate\_to\_255() so that software pipelining can be performed. You can determine that inlining has been performed by looking at the bottom of the generated assembly file. Here, the compiler places a comment that add\_and\_saturate\_to\_255() has been inlined. Note that the function's identifier has been modified due to C++ name mangling.

```
;; Inlined function references:
;; [0] _z23add_and_saturate_to_255ii
```

The inlining can also be seen in the generated assembly code, because there is no CALL instruction to a function in the loop. In fact, because of the inlining (and thus the elimination of the call to a function), the loop can be software pipelined. Software pipelining cannot occur if there is a call to another function in the loop. Note that because of code size concerns, not every call that can be inlined will be inlined automatically. See the *C7000 Optimizing Compiler User's Guide* for more information on inlining.

```
,
;
;
;
           SINGLE SCHEDULED ITERATION
,
, *
            ||$C$C44||:
     0
                       TTCK
                                                                  [A_U]
                                 .D1
                                          D1++(4),BL0
                                                                 [A_D1]
      1
                       SLDW
     2
                       SLDW
                                 .D2
                                          D2++(4), BL1
                                                                 [A_D2] |5|
;*
      3
                       NOP
                                 0x5
                                          ; [A_B]
;*
     8
                       ADDW
                                 .L2
                                          BL1,BL0,BL1
                                                                  [B_L2]
     9
                                          BL2,BL1,B0
B0,*D0++(4)
                       VMTNW
                                 .12
                                                                  [B_L2]
                                                                          15
    10
                       STW
                                 .D1X
                                                                  [A_D1] |5
                                          ||$C$C44||
                       BNL
                                 .B1
                                                                  [A_B]
                                                                         |11|
    11
                       ; BRANCHCC OCCURS {||$C$C44||}
                                                                 [] |11|
```

#### 2.3.6 If Conversion

In order to software pipeline a loop (and thus improve performance), the only branch that may occur in a loop is a branch back to the top of the loop. Branches for if-then and if-then-else statements or for other control-flow constructs will prevent software pipelining.

To get around this limitation, the compiler performs *if-conversion*. If-conversion attempts to remove branches associated with if-then and if-then-else statements, by predicating instructions so that they conditionally execute



depending on the test in the "if" statement. As long as there are not too many nesting levels, too many condition terms, or too many instructions in the if-then or if-then-else statements, if-conversion usually succeeds.

The following example demonstrates if-conversion. In order to software pipeline the "for" loop in this C++ code, if-conversion must be performed. The pragmas are used to prevent the compiler from vectorizing and generating additional code that is not important for this example.

```
if_conversion.cpp
// Compile with "cl7x -mv7100 --opt_level=3 --debug_software_pipeline
   --src_interlist --symdebug:none if_conversion.cpp'
void function_1(int * restrict a, int *restrict b, int *restrict out, int n)
    #pragma UNROLL(1)
    #pragma MUST_ITERATE(1024, ,32)
    for (int i = 0; i < n; i++)
    {
        int result;
        if (a[i] < b[i])
            result = a[i] + b[i];
        else
            result = 0;
        out [i] = result;
    }
}
```

After compilation, the single-scheduled iteration of the loop in the software pipeline information comment block looks like the following:

```
;*
           SINGLE SCHEDULED ITERATION
,
;
;
;
            ||$c$c65||:
;*
     0
                       TICK
                                                                   [A_U]
                                          *D2++(4),A1
                                                                          |17|
     1
                       SL DW
                                 .D1
                                                                  [A_D1]
        | | |
                       SLDW
                                 .D2
                                          *D1++(4),A2
NOP
                                 0x5
                                          ; [A_B]
A2,A1,A0
                       CMPGEW
                                 .L1
                                                                           117
     8
              [!A0]
                                 .D2
                                          A1,A2,D3
                                                                  [A_D2]
                       ADDW
                                                                           117
     9
              [ A0]
                       MVKU32
                                 .s1
                                          0,D3
                                                                  [A_S1]
                                                                          |17
                                          D3,*D0++(4)
    10
                       STW
                                 .D1
                                                                  [A_D1] |17|
                                          ||$C$C65||
        | | |
                       BNL
                                 .в1
                                                                  [A_B] |9|
    11
                       ; BRANCHCC OCCURS {||$C$C65||}
                                                                  [] |9|
```

The instruction [!A0] ADDW.D2 A1,A2,D3 represents the "then" part of the if statement. The instruction [A0] MVK32.S1 0,D3 represents the "else" part of the if statement. The CMPGEW instruction computes the if-condition and puts the result into a predicate register, which is used to conditionally execute the ADDW and MVKU32 instructions.

### Chapter 3

# **Basic Code Optimization**



This section discusses basic code optimization techniques that can be applied to C/C++ code that will run on the C7000 DSP core.

| 3.1 Signed Types for Iteration Counters and Limits       | 26 |
|----------------------------------------------------------|----|
| 3.2 Floating-Point Division                              |    |
| 3.3 Loop-Carried Dependencies and the Restrict Keyword   |    |
| 3.4 Function Calls and Inlining                          |    |
| 3.5 MUST_ITERATE and PROB_ITERATE Pragmas and Attributes | 29 |
| 3.6 If Statements and Nested If Statements               | 30 |
| 3.7 Intrinsics                                           | 30 |
| 3.8 Vector Types                                         | 30 |
| 3.9 C++ Features to Use and Avoid                        |    |
| 3.10 Streaming Engine                                    | 31 |
| 3.11 Streaming Address Generator                         | 31 |
| 3.12 Optimized Libraries                                 |    |
| 3.13 Memory Optimizations                                | 32 |



Basic Code Optimization www.ti.com

#### 3.1 Signed Types for Iteration Counters and Limits

In order for automatic vectorization to occur, the iteration counters and iteration limits for loops should have signed types. In other words, use int rather than unsigned int.

The C language standard defines the behavior for unsigned arithmetic overflow, but not for signed arithmetic overflow.

In the unsigned case, an overflowing value will "wrap-around". Therefore, the compiler must assume (in certain cases) that the loop counter may loop around and thus cannot make certain necessary conclusions about the behavior of the loop.

In the signed type case, the compiler can assume the iteration counter will not overflow, because that has undefined behavior according to the C-standard. Thus, the compiler can make certain conclusions about the behavior of the loop and from there may be able to vectorize the loop.

#### 3.2 Floating-Point Division

Floating-point division operations can be costly. Often, a division operation results in a run-time-support call to a predefined function that implements floating-point division. Such calls prevent software pipelining.

If your code divides by a constant that is known at compile time, consider pre-calculating the 1/constant value and replacing the division operation with a multiplication by 1/constant. The compiler automatically performs this optimization only if the 1/constant value can be precisely represented in an IEEE-754 float or double.

#### 3.3 Loop-Carried Dependencies and the Restrict Keyword

To maximize the efficiency of generated code, the C7000 compiler schedules as many instructions as possible in parallel, especially during software pipelining. To schedule instructions in parallel, the compiler must determine the relationships, or dependencies, between instructions. Dependency means that one instruction must occur before another; for example, a variable must be loaded from memory before it can be used. Because only independent instructions can execute in parallel, dependencies inhibit parallelism.

- If the compiler cannot determine that two instructions are independent, it assumes a dependency and schedules the two instructions sequentially accounting for any latencies needed to complete the first instruction.
- If the compiler can determine that two instructions are independent of one another, it may be able to schedule them in parallel.

#### 3.3.1 Loop-Carried Dependencies

In certain cases when software pipelining, the compiler will not be able to overlap successive iterations of the loop in order to get the best performance. When the compiler is not able to overlap successive iterations of the loop, performance suffers: the initiation interval (ii, described earlier) will be larger than desired and few functional units will be simultaneously utilized.

In almost all cases, this is due to a *loop-carried dependency*. A loop-carried dependency prevents to some degree the execution of iteration i+1 from overlapping with iteration i. A *loop-carried dependency bound* is a lower limit on the initiation interval of the software pipelined loop (and thus a limit on the speed of the software pipelined loop). A loop-carried dependency bound arises because there is a cycle in the ordering constraints (dependencies) for a set of the instructions in a loop. Out of all these cycle lengths in the loop, the maximum loop-carried dependency cycle is the loop-carried dependency bound. This can occur even if there are plenty of functional units available to perform several iterations in parallel.

If the loop-carried dependency bound is greater than the partitioned resource bound, then one of the loop-carried dependencies is slowing the loop, as the initiation interval is always at least the maximum of the partitioned resource bound and the loop-carried dependency bound.

To reduce or eliminate a problematic loop-carried dependency, one must identify the cycle and then find a way to shorten or break it.

www.ti.com Basic Code Optimization

The following example shows a loop with a problematic loop-carried dependency.

The compiler-generated assembly code for this example (shown below) shows that the Loop Carried Dependency Bound in the Software Pipeline Information section of the assembly code is 7 cycles.

```
,
;
;
;
     SOFTWARE PIPELINE INFORMATION
* * * * * * * *
         Loop found in file
                                             : weighted_vector_sum_v3.cpp
        Loop source line
         Loop opening brace source line
         Loop closing brace source line
                                             : 13
                                               1024
         Known Minimum Iteration Count
         Known Max Iteration Count Factor:
                                               32
         Loop Carried Dependency Bound(^{\land}): 12
         Unpartitioned Resource Bound
                                               2 (pre-sched)
        Partitioned Resource Bound
         Searching for software pipeline schedule at
*
            ii = 12 Schedule found with 2 iterations in parallel
           SINGLE SCHEDULED ITERATION
           ||$c$c51||:
,
;*
     0
                      TICK
                                                              [A_U]
                               .D2
                                        *D1++(4),BMO
                                                              [A_D2] |12|
     1
                      I DW
        | |
                      LDW
                               .D1
                                       D2++(4),BM1
                                                             [A_D1] |12|
                      NOP
                               0x5
                                        ; [A_B]
, ; ; ; ; ;
                                        BM\bar{2}, BM\bar{0}, BL\bar{0}
     7
                      MPYWW
                               .M2
                                                              [B_M2]
                                        BM3,BM1,BL1
       \prod
                      MPYWW
                               .N2
                                                             [B_N2] |12|
     8
                      NOP
                               0x3
                                        ; [A_B]
    11
                      ADDW
                                        BL1,BL0,B0
    12
                      STW
                               .D1X
                                        B0,*D0++(4)
                                                              [A_D1]
                                                                     12
                                        ||$c$c51||
                                                              [A_B] |10|
                      BNL
                               .B1
        13
                                                             [] |10|
                      ; BRANCHCC OCCURS {||$C$C51||}
```

The final software pipelined initiation interval of the software pipelined loop is at least the greater of the Loop Carried Dependency Bound and the Partitioned Resource Bound. When the Loop Carried Dependency Bound value is greater than the Partitioned Resource Bound value, this indicates the code has a loop-carried dependency bound problem that likely should be addressed. In other words, when the loop-carried dependence bound is greater than the partitioned resource bound, the software pipelined loop could likely run faster if the loop-carried dependency bound is eliminated. Therefore in this example, because the partitioned resource bound is 2 and the loop-carried dependency bound is 12, this code has an issue that should be investigated.

To identify the problem, we need to look at the instructions involved in the loop-carried dependency. These instructions are marked with the caret "^" symbol in the comment block in the compiler-generated assembly file. Notice that the load and store instructions are marked with a caret. This tells us the compiler thinks there *may* be a loop-carried dependence between successive iterations. This is likely because the compiler cannot prove the stores are writing to an area of memory that is independent of the location from which the load instructions are loading values. In absence of information about the locations of the pointers, arrays and address access patterns, the compiler must assume that successive iterations may load from the location of the previous iteration's stores. See *Hand-Tuning Loops and Control Code on the TMS320C6000* (SPRA666) for more about loop-carried dependencies and how to identify them.



Basic Code Optimization www.ti.com

#### 3.3.2 The Restrict Keyword

The correct the problem in the previous example caused by a loop-carried dependency, we need to tell the compiler that these arrays do not overlap in memory, and thus there is no memory dependence from one iteration to the next.

Many common digital signal processing loops contain one or more load operations, some computation, and then a store operation. Typically, the loads are reading from an array and the stores are storing values to an array. If the compiler does not know that the arrays are separate (or do not overlap), the compiler must be conservative and assume that the stores of iteration i may be needed in the loads of iteration i+1, or i+2, etc. Therefore, it is important to tell the compiler if the load and store arrays inhabit entirely different memory areas (that is, the objects/arrays pointed-to do not overlap).

We can do this with the use of the restrict keyword. This keyword tells the compiler that throughout the scope of the variable (array name or pointer name used to access the array), accesses to that object or array will only be made through that array name or pointer name.

#### Note

This description of the restrict pointer is not precisely accurate; it is good enough for most purposes. If you wish to learn more about the restrict keyword, see the C standard or *Demystifying The Restrict Keyword*.

Use of the restrict keyword effectively allows you to tell the compiler that the store to memory will not write to the same place where the next iterations' loads will read from. Thus, successive iterations can be overlapped when the compiler performs software pipelining, thus allowing the generated code to run faster.

This C function example uses the restrict keyword. The resulting Software Pipeline Information comment block will show that when the restrict keyword is used, the loop-carried dependence bound is zero, while the partitioned resource bound is two. This leads to a much-improved initiation interval (ii) of two cycles.

The Texas Instruments C7000 C/C++ Compiler allows the restrict keyword to be used in both C and C++ modes, despite the restrict keyword not being part of the C++14 or C++17 standards.

#### Note

If you use the restrict keyword incorrectly, the compiler will often produce code with undefined behavior--meaning that the code generated by the compiler will produce an incorrect result.

#### 3.3.3 Run-Time Alias Disambiguation

Under certain limited circumstances, the compiler may generate two loops: one that assumes two pointers are not aliased and one that assumes the two pointers are aliased. It generates a run-time check to determine if the two pointers alias. This optimization is called *run-time alias disambiguation*. The advantage is that the loop that assumes no-aliased pointers can usually software pipeline at a much smaller initiation interval, leading to improved performance of the loop.

The compiler cannot always perform run-time alias disambiguation due to considerations that are too technical to describe here. In addition, certain further optimizations such as nested loop coalescing are inhibited when the compiler produces two different loops with a run-time alias check, so it is best to use the restrict keyword whenever legally possible.

For further discussion and details regarding identifying and eliminating loop-carried dependencies, consult the following references:

- TMS320C6000 Programmer's Guide (SPRU198K), Section 2.2.2 "Memory Dependencies"
- Hand-Tuning Loops and Control Code on the TMS320C6000 (SPRA666), Section 4.1, "Using restrict qualifiers, MUST ITERATE pragmas, and nasserts()"



www.ti.com Basic Code Optimization

#### 3.4 Function Calls and Inlining

In some instances, functional calls inhibit optimization. For instance, a loop containing a function call will not be software pipelined if the compiler is not able to inline the called function. In order to enable optimizations such as software pipelining, it may be necessary to define the called function in one of these ways:

- · In the same source file as the call with the "inline" keyword
- In a .h file included with the #include preprocessor directive, with the called function using the keywords "static inline"

The compiler performs some amount of automatic inlining at the --opt\_level=3 and --opt\_level=4 optimization levels.

#### 3.5 MUST\_ITERATE and PROB\_ITERATE Pragmas and Attributes

The compiler can often generate faster code when the compiler knows how many times a loop will execute. Adding this information via the MUST\_ITERATE and PROB\_ITERATE pragmas and the TI\_must\_iterate and TI prob iterate C++ attributes can help the compiler:

- Determine if it is profitable to vectorize a loop
- Determine if it is profitable to perform certain loop optimizations and loop nest optimizations
- Determine if a redundant loop is needed (see Redundant Loops, below)

Before vectorizing a loop, the compiler tries to determine if the change will improve performance. It is helpful if the compiler has information about the iteration counts of the loop so the compiler can make better predictions about the profitability of vectorization. In the same way, the compiler also tries to determine if certain loop optimizations and loop-nest optimizations will be profitable and so information about the iterations counts of the loops can be helpful to the compiler.

#### Note

Do not provide incorrect information about the iteration count in the MUST\_ITERATE pragma or TI\_must\_iterate C++ attribute. If incorrect information is specified in this pragma/attribute, the compiler may create code that produces unexpected and incorrect behavior.

**Redundant Loops:** In some cases, if the compiler does not know how many times a loop will execute, the compiler generates two different versions of the loop. Software pipelined loops often must execute a certain minimum number of iterations to be legal to execute. If the iteration count of the loop is less than this *minimum safe iteration count*, the compiler generates a run-time iteration count check and branches to either the software pipelined version of the loop, or a *duplicate loop*. That is, the compiler generates a "regular" version of the loop (that executes much more slowly).

The minimum safe iteration count depends on how many iterations were scheduled in parallel and how effectively the compiler was able to perform an optimization called *stage collapsing*. See Section 4.2.6 for more information.

The Software Pipeline Information in the comment block in the assembly file specifies the minimum safe iteration count (iteration count) of the loop and states whether the compiler has generated a duplicate loop.

Because the compiler must sometimes generate a redundant loop and the control code necessary to choose between the two loops, it is helpful to tell the compiler the minimum iteration count of the loop with a MUST\_ITERATE pragma when it is known, as the redundant loop may not be necessary. This can improve performance, especially when the loop is enclosed within an outer loop and if the compiler can then perform loop collapsing or other loop optimizations with the outer and inner loops.



Basic Code Optimization www.ti.com

The following example shows redundant loop generation information in the Software Pipeline Information section of the assembly comment block.

```
;* Redundant loop generated
;* Collapsed epilog stages : 5
;* Prolog not removed
;* Collapsed prolog stages : 0
```

#### 3.6 If Statements and Nested If Statements

Because software pipelined loops may not have any control flow except for the branch to the kernel, any calls or control-flow (if-statements) will prevent software pipelining. To mitigate the effect that control-flow inside a loop has on whether the loop is software pipelined, the compiler performs "if-conversion" on some if statements, which adds a proper predicate onto the instructions in the "then" and "else" clauses. Because there are a limited number of machine predicate registers, and because of other factors, you should limit the nesting level of if-statements inside loops you hope will software pipeline.

#### 3.7 Intrinsics

If the compiler is not using the specialized C7000 instructions you would like it to use, or if the operation is not easily expressed in C (for example, saturated add), there may be an intrinsic available for use in C/C++ code. The available intrinsics appear in the .h files in the include directory of the compiler installation directory.

#### 3.8 Vector Types

If the compiler is not vectorizing a loop, consider using vector types. See the C7000 Optimizing C/C++ Compiler User's Guide (SPRUIG8) for more information.

Be aware that the compiler may vectorize some operations in a loop, but not others, leading to an inefficient loop. In this instance, it may be best to vectorize the loop by hand using vector types and intrinsics.

#### 3.9 C++ Features to Use and Avoid

Some C++ features incur in a run-time penalty. Other features are handled completely at compile-time and thus do not cause a run-time penalty. A full discussion of which C++ features do and do not incur a run-time penalty is outside the scope of this document; discussion is available from several sources on the internet and in print.

Some features that do incur a run-time penalty are so useful in providing the desired level of abstraction and/or safety, that you should consider using them anyway. Here are some guidelines for some of the more commonly-used features:

These features have potential run-time overheads. Consider whether the benefits are worth the cost:

- Calls to new(), although this is essentially no more or less expensive than malloc()
- Use of the Standard Template Library (STL), mainly due to hidden calls to new()
- · Exceptions / exception handling
- Run-Time Type Information (RTTI)
- · Multiple inheritance
- Virtual functions (although the run-time cost is usually small)

Use these features freely, as they have little to no run-time overhead:

- Templates
- · Operator overloading
- Function overloading
- Inlining
- Well-designed inheritance. In particular, calling a member function of a derived class incurs no penalty if the
  object type is known at compile-time.

The following features improve performance and should be used where possible:

- Use of const
- Use of constexpr



www.ti.com Basic Code Optimization

- · Passing objects by-reference instead of passing objects by-value
- Constructs and expressions that can be evaluated at compile-time versus run-time

#### 3.10 Streaming Engine

The C7100 CPU has two *streaming engines*. A streaming engine is a feature of the C7000 CPU cores that aids in loading data from memory to the CPU. The streaming engines can significantly improve the performance of the memory hierarchy by prefetching data from memory to a location near the CPU. Prefetching data can significantly reduce the time needed to bring data into the CPU. It may also reduce the number of L1 data cache capacity misses as the L1 cache is bypassed for data accessed through the streaming engine.

The streaming engine supports up to a six-dimensional address access pattern. When the performance bottleneck involves reads from memory (if D unit resource bound dominates or cache misses dominate), consider using one or both of the streaming engines if the access pattern to the objects in memory is known in advance. Streaming engines have the greatest effect when used in conjunction with loops that are vectorized by hand. For more information on the streaming engine and code examples, please see the *C71x DSP CPU*, *Instruction Set, and Matrix Multiply Accelerator Technical Reference Manual* (SPRUIP0), the *C7000 Optimizing C/C++ Compiler User's Guide* (SPRUIG8), and the c7x\_strm.h file in the include directory of the compiler's installation directory.

The C7000 compiler does not yet make automatic use of the streaming engine feature.

#### 3.11 Streaming Address Generator

Use of a *streaming address generator* can help limit the number of instructions required to calculate an address used for a load or store instruction. This in turn can reduce the resource bound of the software-pipelined loop and so can positively affect the initiation interval of the loop (and thus improve performance of the loop). It can also allow loop collapsing or loop coalescing to occur, possibly leading to improved performance of the loop.

There are four streaming address generators on C7100 cores. For more information and code examples, please see the C71x DSP CPU, Instruction Set, and Matrix Multiply Accelerator Technical Reference Manual (SPRUIP0), the C7000 Optimizing C/C++ Compiler User's Guide (SPRUIG8) and the c7x\_strm.h file in the include directory of the compiler installation directory.

The C7000 compiler does not yet make automatic use of the streaming address generator feature.

#### 3.12 Optimized Libraries

If available, use libraries that are optimized for the C7000 core, such as Tl's Deep Learning Library (TIDL). See the Processor SDK for Jacinto 7 documentation for more information about the available libraries.



Basic Code Optimization www.ti.com

#### 3.13 Memory Optimizations

Optimizations that improve the loading and storing of data are often crucial to the performance of an application. A detailed examination of useful memory optimizations on Keystone 3 devices is outside the scope of this document. However, the following are the most common optimizations used to aid memory system throughput and reduce memory hierarchy latency.

- **Blocking:** Input, output, and temporary arrays/objects are often too large to fit into Multicore Shared Memory Controller (MSMC) or L2 memory. For example, when performing an algorithm over an entire 1000x1000 pixel image, the image is too large to fit into most or all configurations of L2 memory, and the algorithm may thrash the caches, leading to poor performance. Keeping the data as close to the CPU as possible improves memory system performance, but how do we do this when the image is too large to fit into the L2 cache? Depending on the algorithm, it may be useful to use a technique called "blocking," in which the algorithm is modified to operate on only a portion of the data at a given time. Once that "block" of data is processed, the algorithm moves to the next block. This technique is often paired with the other techniques in this list.
- Direct Memory Access (DMA): Consider using the asynchronous DMA capabilities of the device to move new data into MSMC memory or L2 memory and DMA to move processed data out. This frees the C7000 CPU to perform computations while the DMA is readying data for the next frame, block, or layer.
- **Ping-Pong Buffers:** Consider using ping-pong memory buffers so that the C7000 CPU is processing data in one buffer while a DMA transfer is occurring to/from another buffer. When the C7000 CPU is finished processing the first buffer, the algorithm switches to the second buffer, which now has new data as a result of a DMA transfer. Consider placing these buffers in MSMC or L2 memory, which is much faster than DDR memory.

### Chapter 4

## **Understanding the Assembly Comment Blocks**



This chapter provides an explanation of the Software Pipeline Information block added to the assembly output by the --debug\_software\_pipeline compiler option and the Single Scheduled Iteration block added by the --src\_interlist compiler option.

| 4.1 Software Pipelining Processing Stages                | 34 |
|----------------------------------------------------------|----|
| 4.2 Software Pipeline Information Comment Block          |    |
| 4.3 Single Scheduled Iteration Comment Block             | 39 |
| 4.4 Identifying Pipeline Failures and Performance Issues | 39 |



#### 4.1 Software Pipelining Processing Stages

The C7000 compiler goes through three basic stages when software pipelining a loop. The three stages are:

- 1. Qualify the loop for software pipelining
- 2. Collect loop resource and dependency graph information
- 3. Attempt to software pipeline the loop

By the time the compiler tries to software pipeline an inner loop, the compiler may have applied certain transformations to the code in the loop, and also may have combined adjacent or nested loops.

#### Stage 1: Qualification

Several conditions must be met before software pipelining is allowed, or found to be legal, from the compiler's point of view. Two of the most common conditions that cause software pipelining to fail are:

- The loop cannot have too many instructions. Loops that are too big typically require more registers than are available and require a longer compilation time.
- Another function cannot be called from within the loop unless the called function is inlined. Any break in control flow makes it impossible to software pipeline, since multiple iterations are executing in parallel.

If any conditions for software pipelining are *not* met, qualification of the pipeline halts and a disqualification messages appears. See Section 4.4.1. for troubleshooting and Section 4.2.1 for information provided during this stage.

If all conditions for software pipelining are met, the compiler continues to Stage 2.

#### Stage 2: Collecting Loop and Dependency Information

The second stage of software pipelining involves collecting loop resource and dependency graph information. See Section 4.2.2 for information about output from this stage.

#### **Stage 3: Software Pipelining Attempts**

Once the compiler has qualified the loop for software pipelining, partitioned it, and analyzed the necessary loop carry and resource requirements, it can attempt software pipelining.

The compiler attempts to software pipeline a loop starting at a certain *initiation interval* (ii). Each time a compiler software pipelining attempt at a particular initiation interval fails, the ii is increased, and another software pipelining attempt is made. This can be seen in the Software Pipeline Information comment block. This process continues until a software pipelining attempt succeeds or ii sequal to the length of a scheduled loop with no software pipelining. If ii reaches the length of a scheduled loop with no software pipeline stop and the compiler generates a non-software pipelined loop. See Section 4.2.3 for more about the information provided during this stage.

If a software pipelining attempt is not successful, the compiler provides additional feedback to help explain why. See Section 4.4.2 for a list of the most common software pipeline failures and strategies for mitigation.

After a successful software pipeline schedule and register allocation is found at a particular initiation interval, more information about the loop is displayed. See Section 4.2.4, Section 4.2.5, Section 4.2.6, Section 4.2.7, and Section 4.2.8.

#### 4.2 Software Pipeline Information Comment Block

The subsections that follow describe some of the information found in the Software Pipeline Information comment block that is added to the generated assembly source file when you use the --debug\_software\_pipeline compiler option. The --keep\_asm option is used automatically in this case to preserve the assembly output.

By understanding the feedback that is generated when the compiler pipelines a loop, you may be able tune your C code to obtain better performance.



#### 4.2.1 Loop and Iteration Count Information

If the compiler qualifies the loop for software pipelining, the first few lines look like the following example:

The loop counter is called the "iteration counter" because it is the number of iterations through a loop. The statistics provided in this section of the block are:

- Loop found in file, Loop source line, Loop opening brace source line, Loop closing brace source line: Information about where the loop is located in the original C/C++ source code.
- **Known Minimum Iteration Count**: The minimum number of times the loop might execute given the amount of information available to the compiler.
- **Known Maximum Iteration Count**: The maximum number of times the loop might execute given the amount of information available to the compiler.
- Known Max Iteration Count Factor: The maximum number that will divide evenly into the iteration count. Even though the exact value of the iteration count is not deterministic, it may be known that the value is a multiple of 2, 4, etc., which may allow more aggressive packed data/SIMD optimization.

The compiler tries to identify information about the loop counter such as minimum value (known minimum iteration count), and whether it is a multiple of something (has a known maximum iteration count factor).

If a Max Iteration Count Factor greater than 1 is known, the compiler might be more aggressive in packed data processing and loop unrolling optimizations. For example, if the exact value of a loop counter is not known but it is known that the value is a multiple of some number, the compiler may be better able to unroll the loop to improve performance.

#### 4.2.2 Dependency and Resource Bounds

The second stage of software pipelining involves collecting loop resource and dependency graph information. The results of Stage 2 are shown in the Software Pipeline Information comment block as follows:

```
;* Loop Carried Dependency Bound(^) : 2
;* Unpartitioned Resource Bound : 12
;* Partitioned Resource Bound : 12 (pre-sched)
```

The statistics provided in this section of the block are:

• Loop Carried Dependency Bound: The distance of the largest loop carry path, if one exists. A loop carry path occurs when one iteration of a loop writes a value that must be read in a future iteration. Instructions that are part of the loop carry bound are marked with the ^ symbol. The number shown for the loop carried dependency bound is the minimum iteration interval due to a loop carry dependency bound for the loop.

If the Loop Carried Dependency Bound is larger than the Resource Bounds, there may be an inefficiency in the loop, and you may be able to improve performance by conveying additional information to the compiler. Potential solutions for this are discussed in Section 3.3.

- **Unpartitioned Resource Bound**: The best case resource bound minimum initiation interval (mii) before the compiler has partitioned each instruction to the A or B side.
- Partitioned Resource Bound (pre-sched, post-sched): The mii after instructions are partitioned to the A and B sides. Pre-scheduling and post-scheduling values are given. The post-scheduling value is the partitioned resource bound after scheduling occurs. Scheduling sometimes involves the addition of instructions, which may affect the resource bound.



#### 4.2.3 Initiation Interval (ii) and Iterations

The following information is provided about software pipelining attempts:

- **Initiation interval (ii):** In the example, the compiler was able to construct a software pipelined loop that starts a new iteration every 13 cycles. The smaller the initiation interval, the fewer cycles it will take to execute the loop.
- **Iterations in parallel:** When in the steady-state (kernel), the example loop is executing different parts of three iterations at the same time. This means that before iteration n has completed, iterations n+1 and n+2 have begun.

```
;* Searching for software pipeline schedule at ...
;* ii = 12 Cannot allocate machine registers
...
;* ii = 12 Register is live too long
;* ii = 13 Schedule found with 3 iterations in parallel
```

#### 4.2.4 Constant Extensions

Each execute packet can hold up to two constant extensions.

```
;* Constant Extension #0 Used [C0] : 10
;* Constant Extension #1 Used [C1] : 10
```

Constant extension slots are for use by instructions in the execute packet if an instruction's operand constant is too large to fit in the encoding space within the instruction. For instructions that have a constant operand, the encoding space for the constant is usually only a few bits. If a constant will not fit in those few bits, the compiler may use a constant extension slot.

The "Constant Extension #n Used" feedback shows the number of constant extension slots used for each of the C0 and C1 slots.



#### 4.2.5 Resources Used and Register Tables

The Resource Partition table summarizes how the instructions have been assigned to various machine resources and how they have been partitioned between the A and B side. Examples are shown below.

An asterisk (\*) marks entries that determine the resource bound value (that is, the maximum mii). Because many C7000 instructions can execute on more than one functional unit, the table breaks the functional units into categories by possible resource combinations.

• Individual Functional Units (.L, .S, .D, .M, .C units, etc.) show the total number of instructions that specifically require that unit. Instructions that can operate on multiple functional units are not included in these counts.

| ** | .S units | 0 0   |
|----|----------|-------|
| ;* | .M units | 4 12* |
|    |          |       |

• **Grouped Functional Units (.M/.N, .L/.S, .L/.S/.C, etc)** show the total number of instructions that can execute on all of the listed functional units. For example, if the .L/.S line shows an A-side value of 14 and a B-side value of 12, it means that there are 14 instructions that will execute on either .L1 or .S1 and 12 instructions that will execute on either .L2 or .S2.

```
;* .L/.S units 1 8;* .L/.S/.C units 0 0
```

• .X cross paths shows the number of cross path buses needed to move data from one datapath to another (A-to-B or B-to-A).

```
;* .X cross paths 13* 0
```

• **Bound:** shows the minimum i i at which the loop can software pipeline when only considering instructions that can operate on the set of functional units listed on that line. For example, if the .L .S .LS line shows an A-side value of 3 and a B-side value of 2, it means that there are enough instructions that need to go on .L and .S that require .L1 and .S1 for three cycles in the software pipeline schedule and .L2 and .S2 for two cycles in the software pipeline schedule. Note that the .L .S .LS notation means we take into account instructions that can go only on the .L unit or can go only on .S or can go on either .L or .S.

```
;* Bound(.L .S .LS) 1 4
```

• **Register Usage Tables** The compiler shows which CPU registers are used on each cycle of the software pipelined kernel. It is difficult to use this information to improve the performance of the loop, but the information can give you an idea of how many registers are active throughout the loop.

```
;* Regs Live Always : 6/ 1/ 4/
;* Max Regs Live : 56/26/29/
;* Max Cond Regs Live : 0/ 0/ 0/
```



#### 4.2.6 Stage Collapsing

In some cases, the compiler can reduce the minimum safe iteration count of a software pipelined loop through a transformation called *stage collapsing*. Information on stage collapsing is displayed in the Software Pipeline Information comment block. An example is shown below.

Stage collapsing always helps reduce code size. Stage collapsing is usually beneficial for performance, because it can lower the minimum safe iteration count for the software pipelined loop so that when the loop executes only a small number of times, it is more likely the (faster) software pipelined loop can be executed and execution does not have to be transferred to the duplicate loop (which is slower and not-software pipelined).

```
;* Epilog not entirely removed
;* Collapsed epilog stages : 2
;* Prolog not removed
;* Collapsed prolog stages : 0
;*
;* Max amt of load speculation : 128 bytes
;*
;* Minimum safe iteration count : 3 (after unrolling)
```

The feedback in the example above shows that two epilog stages were collapsed. However, the compiler was not able to collapse any prolog stages and thus was not able to reduce the minimum safe iteration count of the software pipelined loop down to one (which is the best-case). There are complex technical reasons why a software pipelined loop prolog or epilog may not be removed, and it is difficult for a programmer to affect this outcome.

When performing stage collapsing, the compiler may generate code that executes load instructions *speculatively*, meaning that the result of the load might not be used. In cases where the compiler needs to speculatively execute load instructions, it only does so with load instructions that will not cause an exception if the address accessed is outside the range of legal memory. The feedback about "Max amt of load speculation" tells you how far outside the range of normal address accesses the load speculation will access.

#### 4.2.7 Memory Bank Conflicts

The compiler has limited understanding of the memory bank structure of the cache hierarchy and the alignment of the objects being accessed via memory. Nevertheless, the compiler tries to estimate the effects on performance of an unlucky memory alignment due to memory bank conflicts stalls. It presents this information in the Software Pipeline Information comment block.

```
;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 } ;* Mem bank perf. penalty (est.) : 0.0%
```

#### 4.2.8 Loop Duration Formula

The compiler also emits a formula for the number of cycles it will take to execute the software pipelined loop in question. Because the compiler often schedules the prolog and/or epilog in parallel with some of the other code surrounding the loop, this formula is not precise when trying to compute the expected number of cycles for an entire function.

```
;* Total cycles (est.) : 13 + iteration_cnt * 4
```



#### 4.3 Single Scheduled Iteration Comment Block

Because the iterations of a software-pipelined loop overlap, it can be difficult to understand the assembly code corresponding to the loop. If source code is compiled with the --debug\_software\_pipeline option, a Single Scheduled Iteration comment block is added to the generated assembly source file. Examining this code makes it easier to understand what the compiler has done and in turn makes optimizing the loop easier.

```
SINGLE SCHEDULED ITERATION
             ||$C$C51||:
;*
      0
                         TICK
                                                                         [U A]
                                              *D1++(4),BM0
                                                                        [Ā_D2]
                                                                                 |12|
                                    . D2
      1
                         LDW
                                              *D2++(4),BM1
                                                                        [A_D1]
         I DW
                                    . D1
                                                                                |12|
× × × × ×
      2
7
                         NOP
                                    0x5
                                              ; [A_B]
                         MPYWW
                                              BM\overline{2}, BM\overline{0}, BL\overline{0}
                         MPYWW
                                    .N2
                                              BM3,BM1,BL1
                                                                        [B_N2] |12|
         | | |
      8
                                    0x3
                         NOP
                                              : [A B]
     11
                         ADDW
                                    .L2
                                              BL1,BL0,B0
                                                                        [B_L2] |12|
     12
                                    .D1X
                                              B0,*D0++(4)
                                                                        [A_D1] |12
[A_B] |10|
                         STW
                                              ||$C$C51||
         | | |
                         BNL
                                    .B1
;*
     13
                         ; BRANCHCC OCCURS {||$C$C51||}
                                                                        [] |10|
```

#### 4.4 Identifying Pipeline Failures and Performance Issues

The subsections that follow explain situations that may prevent loops from being optimized.

#### 4.4.1 Issues that Prevent a Loop from Being Software Pipelined

The following situations may prevent a loop from being eligible for software pipelining. These can be detected by examining the assembly output and the Software Pipeline Information in the comment block.

- Loop contains function calls: Although a software pipelined loop can contain intrinsics, it cannot contain function calls. This includes code that will result in a call to un-inlinable run-time support routines, such as floating-point division. You may attempt to inline small, user-defined functions; see Section 2.3.5.
- Loop contains control code: In some cases, the compiler cannot remove all of the control flow from if-then-else statements or "?:" statements. You may attempt to optimize such situations by using if statements only around code that updates memory and around variables whose values are calculated inside the loop and used only outside the loop.
- Conditionally incremented loop control variable is not software pipelined. If a loop contains a loop control variable that is conditionally incremented, the compiler will not be able to software pipeline the loop.

```
for (i = 0; i < x; i++)
{
    ...
    if (b > a)
        i += 2
}
```

- Too many instructions. Oversized loops typically cannot be scheduled due to the large number of registers needed. However, some large loops require an undue amount of time for compilation. A potential solution may be to break the loop into multiple smaller loops.
- Uninitialized iteration counter. The loop counter may not have been set to an initial value.
- Cannot identify iteration counter. The loop control is too complex. Try to simplify the loop.



#### 4.4.2 Software Pipeline Failure Messages

Possible software pipeline failure messages provided by the compiler include the following:

- Address increment too large. During software pipelining, the compiler allows reordering of all loads and stores occurring from the same array or pointer. This maximizes flexibility in scheduling. Once a schedule is found, the compiler returns and adds the appropriate offsets and increments/decrements to each load and store. Sometimes, the loads and/or stores end up being offset too far from each other after reordering (the limit for standard load pointers is +/- 32). If this happens, try to restructure the loop so that the pointers are closer together or to rewrite the pointers to use precomputed register offsets.
- Cannot allocate machine registers. After software pipelining and finding a valid schedule, the compiler allocates all values in the loop to specific machine registers. In some cases, the compiler runs out of machine registers in which it can allocate values of variables and intermediate results. If this happens, either try to simplify the loop or break the loop up into multiple smaller loops. In some cases, the compiler can successfully software pipeline a loop at a higher initiation interval (ii).
- Cycle Count Too High. Not Profitable. In rare cases, the iteration interval of a software pipelined loop is higher than a non-pipelined loop. In this case it is more efficient to execute the non-software pipelined loop. A possible solution is to split the loop into multiple loops or reduce the complexity of the loop.
- **Did not find schedule.** Sometimes the compiler simply cannot find a valid software pipeline schedule at a particular initiation interval. A possible solution is to split the loop into multiple loops or reduce the complexity of the loop.
- Iterations in parallel > max. iteration count. Not all loops can be profitably pipelined. Based on the available information for the largest possible iteration count, the compiler estimates that it will always be more profitable to execute a non-software-pipelined version than to execute the pipelined version, given the schedule found at the current initiation interval. A possible solution may be to unroll the loop completely.
- Iterations in parallel > min. iteration count. Based on the available information on the minimum iteration count, it is not always safe to execute the pipelined version of the loop. Normally, a redundant loop would be generated. However, in this case, redundant loop generation has been suppressed via the -- opt\_for\_speed=3 or lower option. A possible solution is to add the MUST\_ITERATE pragma to give the compiler more information on the minimum iteration count of the loop.
- Register is live-too long. Sometimes the compiler finds a valid software pipeline schedule, but one or more of the values is live too long. The lifetime of a register is determined by the cycle time between when a value is written into the register and the last cycle this value is read by another instruction. By definition, a variable can never be live longer than the ii of the loop, because the next iteration of the loop overwrites that value before it is read. After this message, the compiler provides a detailed description of which values are live to long:

```
ii = 11 Register is live too long
|72| -> |74|
|73| -> |75|
```

The numbers 72, 73, 74, and 75 in this example correspond to line numbers and can be mapped back to the offending instructions. The compiler aggressively attempts to both prevent and fix live-too longs. Techniques you can use to resolve live-too longs have low probabilities of success. Therefore, such techniques are not discussed in this document. In addition, the compiler can usually find a successful software pipeline schedule at a higher initiation interval (ii).



#### 4.4.3 Performance Issues

You can find the following issues by examining the assembly source and the Software Pipeline Information comment block. Potential solutions are given for each condition.

- Two Loops are Generated, One Not Software Pipelined / Duplicate Loop Generated. If you see the message "Duplicate Loop Generated" in the Software Pipeline Information comment block, or you notice that there is a second version of the loop that isn't software pipelined, it may mean that when the iteration count (iteration count) of the loop is too low, it is illegal to execute the software pipelined version of the loop that the compiler has created. In order to generate only the software pipelined version of the loop, the compiler needs to prove that the minimum iteration count of the loop would be high enough to always safe execute the pipelined version. If the minimum number of iterations of the loop is known, using the MUST\_ITERATE pragma to tell the compiler this information may help eliminate the duplicate loop.
- Loop Carried Dependency Bound is Larger than the Partitioned Resource Bound. If you see a loop carried dependency bound that is higher than the partitioned resource bound, you likely have one of two problems. First, the compiler may think there is a memory dependence from a store to a subsequent load. See the "Memory Dependencies" section of the TMS320C6000 Programmer's Guide (SPRU198) for more information. Second, a computation in one iteration of the loop may be used in the next iteration of the loop. In this case, the only option is to try to eliminate the flow of information from one iteration to the next, thereby making the iterations more independent of each other.
- Large Outer Loop Overhead in Nested Loop. If the inner loop count of a nested loop is relatively small, the time to execute the outer loop can become a large percentage of the total execution time. For cases where this seems to degrade the overall loop nest performance, two approaches can be tried. First, if there are not too many instructions in the outer loop, you may want to give a hint to the compiler that it should coalesce the loop nest. Try using the COALESCE\_LOOP pragma and check the relative performance of the entire loop nest. If the COALESCE\_LOOP pragma does not work, and the number of iterations of the inner loop is small and do not vary, fully unrolling the inner loop by hand may improve performance of the nested loop because the outer loop may be able to be software pipelined.
- There are Memory Bank Conflicts If the compiler generates two memory accesses in one cycle and those accesses reside within the same memory block in the cache hierarchy, a memory bank stall can occur. To avoid this degradation, memory bank conflicts can be avoided by placing the two accesses in different memory blocks through use of the DATA\_ALIGN pragma.

See the C7000 Optimizing C/C++ Compiler User's Guide (SPRUIG8) for information about pragmas.

# **Revision History**



NOTE: Page numbers for previous revisions may differ from page numbers in the current version.

| Changes from January 21, 2022 to December 15, 2023 (from Revision B (January 2022) to                                                                                                         |             |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
| Revision C (December 2023))                                                                                                                                                                   | Page        |
| Added C7504 and C7524 sizes to vector width and vector register descriptions.                                                                                                                 | 9           |
| · Changed "trip count" to "iteration count" throughout to match new software pipelined loop information                                                                                       | 14          |
| Updated and extended information about vectorization and vector predication                                                                                                                   | 17          |
| Added section about automatic use of Streaming Engine and Streaming Address Generator                                                                                                         | 21          |
| Changes from March 15, 2021 to January 21, 2022 (from Revision A (March 2021) to Revision B (January 2022))                                                                                   | Page        |
| Updated split datapath and functional unit diagram and description                                                                                                                            |             |
| Changes from May 1, 2020 to March 15, 2021 (from Revision (May 2020) to Revision A                                                                                                            |             |
| (March 2021))                                                                                                                                                                                 | <b>Page</b> |
| <ul> <li>Vector predicated stores generated by the compiler may trigger page fault exceptions in certain situation</li> <li>This issue can be corrected in the linker command file</li> </ul> |             |

#### IMPORTANT NOTICE AND DISCLAIMER

TI PROVIDES TECHNICAL AND RELIABILITY DATA (INCLUDING DATA SHEETS), DESIGN RESOURCES (INCLUDING REFERENCE DESIGNS), APPLICATION OR OTHER DESIGN ADVICE, WEB TOOLS, SAFETY INFORMATION, AND OTHER RESOURCES "AS IS" AND WITH ALL FAULTS, AND DISCLAIMS ALL WARRANTIES, EXPRESS AND IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT OF THIRD PARTY INTELLECTUAL PROPERTY RIGHTS.

These resources are intended for skilled developers designing with TI products. You are solely responsible for (1) selecting the appropriate TI products for your application, (2) designing, validating and testing your application, and (3) ensuring your application meets applicable standards, and any other safety, security, regulatory or other requirements.

These resources are subject to change without notice. TI grants you permission to use these resources only for development of an application that uses the TI products described in the resource. Other reproduction and display of these resources is prohibited. No license is granted to any other TI intellectual property right or to any third party intellectual property right. TI disclaims responsibility for, and you will fully indemnify TI and its representatives against, any claims, damages, costs, losses, and liabilities arising out of your use of these resources.

TI's products are provided subject to TI's Terms of Sale or other applicable terms available either on ti.com or provided in conjunction with such TI products. TI's provision of these resources does not expand or otherwise alter TI's applicable warranties or warranty disclaimers for TI products.

TI objects to and rejects any additional or different terms you may have proposed.

Mailing Address: Texas Instruments, Post Office Box 655303, Dallas, Texas 75265 Copyright © 2023, Texas Instruments Incorporated