SPRUIV4D May 2020 – May 2024
Very-Long Instruction Word (VLIW) digital signal processors (DSPs) like the C7000 depend on software pipelining of loops to achieve maximum performance. Software pipelining is a technique in which successive iterations of a source loop are overlapped so that the functional units on the CPU are utilized on as many cycles as possible throughout the loop.
The following figure shows loop iteration execution both without and with software pipelining. You can see that without software pipelining, loops are scheduled so that loop iteration i completes before iteration i+1 begins. With software pipelining, iterations overlap. Thus, as long as correctness can be preserved, iteration i+1 can start before iteration i finishes. This generally permits a much higher utilization of the machine’s resources than might be achieved from other scheduling techniques. In a software-pipelined loop, even though a single iteration might take s cycles to complete, a new iteration is initiated every ii cycles.
In an efficient software pipelined loop, ii is much less than s. ii is called the initiation interval; it is the number of cycles between starting iteration i and starting iteration i+1. s is the number of cycles for the first iteration to complete, or equivalently, the length of a single scheduled iteration of the software-pipelined loop.
The compiler attempts to software pipeline the innermost source loops. These are loops that do not have any other loops within them. Note that during the compilation process, software pipelining occurs after inlining and after loop transformations that may combine loops, so in certain cases you may see the compiler software pipelining more of your code than you expect.
After software pipelining, the loop has three major phases, as shown in the following figure:
The following example shows the source code for a simple weighted vector sum.
// weighted_vector_sum.cpp
// Compile with "cl7x -mv7100 --opt_level=3 --debug_software_pipeline
// --src_interlist --symdebug:none weighted_vector_sum.cpp"
void weighted_sum(int * restrict a, int *restrict b, int *restrict out,
int weight_a, int weight_b, int n)
{
#pragma UNROLL(1)
#pragma MUST_ITERATE(1024, ,32)
for (int i = 0; i < n; i++)
{
out[i] = a[i] * weight_a + b[i] * weight_b;
}
}
To simplify this first software-pipelining example, two pragmas are used:
Then we compile this code with the following command:
cl7x --opt_level=3 --debug_software_pipeline --src_interlist --symdebug:none weighted_vector_sum.cpp
The --symdebug:none
option prevents the compiler from generating debug information and the associated
debug directives in the assembly. This debug information is not relevant to the
discussion in this document and if included, would unnecessarily lengthen the
examples shown here. Normally, you would not turn off debug generation as the
generation of debug information does not degrade performance.
Because the --src_interlist option is used, the compiler-generated assembly file is not deleted and has the following contents:
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : weighted_vector_sum.cpp
;* Loop source line : 10
;* Loop opening brace source line : 11
;* Loop closing brace source line : 13
;* Known Minimum Iteration Count : 1024
;* Known Max Iteration Count Factor : 32
;* Loop Carried Dependency Bound(^) : 0
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound : 2 (pre-sched)
;*
;* Searching for software pipeline schedule at ...
;* ii = 2 Schedule found with 7 iterations in parallel
;*
;* Partitioned Resource Bound(*) : 2 (post-sched)
. . .
;*----------------------------------------------------------------------------*
;* SINGLE SCHEDULED ITERATION
;*
;* ||$C$C36||:
;* 0 TICK ; [A_U]
;* 1 SLDW .D1 *D1++(4),BM0 ; [A_D1] |12|
;* || SLDW .D2 *D2++(4),BM1 ; [A_D2] |12|
;* 2 NOP 0x5 ; [A_B]
;* 7 MPYWW .N2 BM2,BM0,BL0 ; [B_N] |12|
;* || MPYWW .M2 BM3,BM1,BL1 ; [B_M2] |12|
;* 8 NOP 0x3 ; [A_B]
;* 11 ADDW .L2 BL1,BL0,B0 ; [B_L2] |12|
;* 12 STW .D1X B0,*D0++(4) ; [A_D1] |12|
;* || BNL .B1 ||$C$C36|| ; [A_B] |10|
;* 13 ; BRANCHCC OCCURS {||$C$C36||} ; [] |10|
;*----------------------------------------------------------------------------*
||$C$L1||: ; PIPED LOOP PROLOG
; EXCLUSIVE CPU CYCLES: 8
TICK ; [A_U] (R) (SP) <1,0>
|| SLDW .D1 *D1++(4),BM1 ; [A_D1] |12| (P) <1,1>
|| SLDW .D2 *D2++(4),BM0 ; [A_D2] |12| (P) <1,1>
MV .L2X A7,B0 ; [B_L2] |7| (R)
|| TICK ; [A_U] (P) <2,0>
MV .L2X A8,B1 ; [B_L2] |7| (R)
|| SLDW .D1 *D1++(4),BM0 ; [A_D1] |12| (P) <2,1>
|| SLDW .D2 *D2++(4),BM1 ; [A_D2] |12| (P) <2,1>
MV .S2 B0,BM2 ; [B_S2] (R)
|| MV .L2 B1,BM3 ; [B_L2] (R)
|| TICK ; [A_U] (P) <3,0>
MPYWW .N2 BM2,BM1,BL0 ; [B_N] |12| (P) <0,7>
|| MPYWW .M2 BM3,BM0,BL1 ; [B_M2] |12| (P) <0,7>
|| SLDW .D1 *D1++(4),BM0 ; [A_D1] |12| (P) <3,1>
|| SLDW .D2 *D2++(4),BM1 ; [A_D2] |12| (P) <3,1>
TICK ; [A_U] (P) <4,0>
MPYWW .N2 BM2,BM1,BL0 ; [B_N] |12| (P) <1,7>
|| MPYWW .M2 BM3,BM0,BL1 ; [B_M2] |12| (P) <1,7>
|| SLDW .D1 *D1++(4),BM0 ; [A_D1] |12| (P) <4,1>
|| SLDW .D2 *D2++(4),BM1 ; [A_D2] |12| (P) <4,1>
MV .D2 A6,D0 ; [A_D2] (R)
|| ADDD .D1 SP,0xfffffff8,SP ; [A_D1] (R)
|| TICK ; [A_U] (P) <5,0>
;** --------------------------------------------------------------------------*
||$C$L2||: ; PIPED LOOP KERNEL
; EXCLUSIVE CPU CYCLES: 2
ADDW .L2 BL1,BL0,B0 ; [B_L2] |12| <0,11>
|| MPYWW .N2 BM2,BM0,BL0 ; [B_N] |12| <2,7>
|| MPYWW .M2 BM3,BM1,BL1 ; [B_M2] |12| <2,7>
|| SLDW .D1 *D1++(4),BM0 ; [A_D1] |12| <5,1>
|| SLDW .D2 *D2++(4),BM1 ; [A_D2] |12| <5,1>
BNL .B1 ||$C$L2|| ; [A_B] |10| <0,12>
|| STW .D1X B0,*D0++(4) ; [A_D1] |12| <0,12>
|| TICK ; [A_U] <6,0>
;** --------------------------------------------------------------------------*
||$C$L3||: ; PIPED LOOP EPILOG
; EXCLUSIVE CPU CYCLES: 7
;** ----------------------- return;
ADDD .D2 SP,0x8,SP ; [A_D2] (O)
|| LDD .D1 *SP(16),A9 ; [A_D1] (O)
|| ADDW .L2 BL1,BL0,B0 ; [B_L2] |12| (E) <4,11>
|| MPYWW .N2 BM2,BM0,BL1 ; [B_N] |12| (E) <6,7>
|| MPYWW .M2 BM3,BM1,BL0 ; [B_M2] |12| (E) <6,7>
STW .D1X B0,*D0++(4) ; [A_D1] |12| (E) <4,12>
ADDW .L2 BL1,BL0,B0 ; [B_L2] |12| (E) <5,11>
STW .D1X B0,*D0++(4) ; [A_D1] |12| (E) <5,12>
ADDW .L2 BL0,BL1,B0 ; [B_L2] |12| (E) <6,11>
STW .D1X B0,*D0++(4) ; [A_D1] |12| (E) <6,12>
RET .B1 ; [A_B] (O)
|| PROT ; [A_U] (E)
; RETURN OCCURS {RP} ; [] (O)
This assembly output shows the software pipelined loop from the compiler-generated assembly file along with part of the software pipelining information comment block, which includes important information about various characteristics of the loop.
If the compiler successfully software
pipelines a loop, the compiler-generated assembly code contains a software pipeline
information comment block that contains a message about "ii = xx Schedule found with
yy iterations in parallel". The initiation interval, (ii
),
is a measure of how often the software pipelined loop is able to start executing a
new iteration of the loop. The smaller the initiation interval, the fewer cycles it
will take to execute the entire loop. The software-pipelined loop information also
includes the source lines from which the loop originates, a description of the
resource and latency requirements for the loop, and whether the loop was unrolled
(among other information). When compiling with –mw
, the information
also contains a copy of the single scheduled iteration.
In this example, the achieved initiation interval (ii) is 2 cycles, and the number of iterations that will run in parallel is 7.
The comment block also includes a single-scheduled iteration view of the software pipelined loop. The single-scheduled iteration view of the software pipelined loop allows you to see how the compiler transformed the code and how the compiler scheduled one iteration of the software pipelined loop overlap iterations in software pipelining. See Section 5.2 for more information on how to interpret the information in this comment block.