SPRUIV4D May 2020 – May 2024
In certain cases when software
pipelining, the compiler will not be able to overlap successive iterations of the
loop in order to get the best performance. When the compiler is not able to overlap
successive iterations of the loop, performance suffers: the initiation interval
(ii
, described earlier) will be larger than desired and few
functional units will be simultaneously utilized.
In almost all cases, this is due to a loop-carried dependency. A loop-carried
dependency prevents to some degree the execution of iteration i+1
from overlapping with iteration i
. A loop-carried dependency
bound is a lower limit on the initiation interval of the software pipelined
loop (and thus a limit on the speed of the software pipelined loop). A loop-carried
dependency bound arises because there is a cycle in the ordering constraints
(dependencies) for a set of the instructions in a loop. Out of all these cycle
lengths in the loop, the maximum loop-carried dependency cycle is the loop-carried
dependency bound. This can occur even if there are plenty of functional units
available to perform several iterations in parallel.
If the loop-carried dependency bound is greater than the partitioned resource bound, then one of the loop-carried dependencies is slowing the loop, as the initiation interval is always at least the maximum of the partitioned resource bound and the loop-carried dependency bound.
To reduce or eliminate a problematic loop-carried dependency, one must identify the cycle and then find a way to shorten or break it.
The following example shows a loop with a problematic loop-carried dependency.
void weighted_sum(int * a, int * b, int * out,
int weight_a, int weight_b, int n)
{
#pragma UNROLL(1)
#pragma MUST_ITERATE(1024, ,32)
for (int i = 0; i < n; i++)
{
out[i] = a[i] * weight_a + b[i] * weight_b;
}
}
The compiler-generated assembly code for this example (shown below) shows that the Loop Carried Dependency Bound in the Software Pipeline Information section of the assembly code is 7 cycles.
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : weighted_vector_sum_v3.cpp
;* Loop source line : 10
;* Loop opening brace source line : 11
;* Loop closing brace source line : 13
;* Known Minimum Iteration Count : 1024
;* Known Max Iteration Count Factor : 32
;* Loop Carried Dependency Bound(^) : 12
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound : 2 (pre-sched)
;*
;* Searching for software pipeline schedule at ...
;* ii = 12 Schedule found with 2 iterations in parallel
. . .
;*----------------------------------------------------------------------------*
;* SINGLE SCHEDULED ITERATION
;*
;* ||$C$C51||:
;* 0 TICK ; [A_U]
;* 1 LDW .D2 *D1++(4),BM0 ; [A_D2] |12| ^
;* || LDW .D1 *D2++(4),BM1 ; [A_D1] |12| ^
;* 2 NOP 0x5 ; [A_B]
;* 7 MPYWW .M2 BM2,BM0,BL0 ; [B_M2] |12| ^
;* || MPYWW .N2 BM3,BM1,BL1 ; [B_N2] |12| ^
;* 8 NOP 0x3 ; [A_B]
;* 11 ADDW .L2 BL1,BL0,B0 ; [B_L2] |12| ^
;* 12 STW .D1X B0,*D0++(4) ; [A_D1] |12| ^
;* || BNL .B1 ||$C$C51|| ; [A_B] |10|
;* 13 ; BRANCHCC OCCURS {||$C$C51||} ; [] |10|
;*----------------------------------------------------------------------------*
The final software pipelined initiation interval of the software pipelined loop is at least the greater of the Loop Carried Dependency Bound and the Partitioned Resource Bound. When the Loop Carried Dependency Bound value is greater than the Partitioned Resource Bound value, this indicates the code has a loop-carried dependency bound problem that likely should be addressed. In other words, when the loop-carried dependence bound is greater than the partitioned resource bound, the software pipelined loop could likely run faster if the loop-carried dependency bound is eliminated. Therefore in this example, because the partitioned resource bound is 2 and the loop-carried dependency bound is 12, this code has an issue that should be investigated.
To identify the problem, we need to
look at the instructions involved in the loop-carried dependency. These instructions
are marked with the caret "^
" symbol in the comment block in the
compiler-generated assembly file. Notice that the load and store instructions are
marked with a caret. This tells us the compiler thinks there may be a
loop-carried dependence between successive iterations. This is likely because the
compiler cannot prove the stores are writing to an area of memory that is
independent of the location from which the load instructions are loading values. In
absence of information about the locations of the pointers, arrays and address
access patterns, the compiler must assume that successive iterations may load from
the location of the previous iteration's stores. See Hand-Tuning Loops and
Control Code on the TMS320C6000 (SPRA666) for more about loop-carried dependencies and how to identify
them.