SPRAC21 Application note

SPRAC21A June 2016 – June 2019 OMAP-L132 , OMAP-L138 , TDA2E , TDA2EG-17 , TDA2HF , TDA2HG , TDA2HV , TDA2LF , TDA2P-ABZ , TDA2P-ACD , TDA2SA , TDA2SG , TDA2SX , TDA3LA , TDA3LX , TDA3MA , TDA3MD , TDA3MV

6.1.2.3 Pipeline Write

The C code for the pipeline write function is:

void pipeline_write(unsigned byte_cnt)
{
	long long *restrict dst = (long long *)ext_buf[1];
	unsigned int wrStartTime, wrStopTime;
	int i;
	_nassert((int)dst == 0);
	wrStartTime = CSL_tscRead();
	#pragma UNROLL(2)
	for (i=0; i<byte_cnt/8; i++)
	{
	  dst[i] = 0xDEADDEAD;
	}
	wrStopTime = CSL_tscRead();
	WBINVALIDATE
	wrDuration = (float)(wrStopTime-wrStartTime)/(DSP_FREQ/1000);
}

The analysis of the scheduled iteration is given out by the compiler as:

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : ../pipeline_loop.c
;*      Loop source line                 : 89
;*      Loop opening brace source line   : 96
;*      Loop closing brace source line   : 98
;*      Loop Unroll Multiple             : 2x
;*      Known Minimum Trip Count         : 1                    
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 1
;*      Partitioned Resource Bound(*)    : 1
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     0        0     
;*      .D units                     1*       1*    
;*      .M units                     0        0     
;*      .X cross paths               0        0     
;*      .T address paths             1*       1*    
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          0        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        0     
;*      Bound(.L .S .D .LS .LSD)     1*       1*    
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 1  Schedule found with 2 iterations in parallel
;*----------------------------------------------------------------------------*
;*       SETUP CODE
;*
;*                  MV              B6,A3
;*                  ADD             8,A3,A3
;*                  MV              A4,B4
;*                  MV              A5,B5
;*
;*        SINGLE SCHEDULED ITERATION
;*
;*        $C$C94:
;*   0              STDW    .D2T2   B5:B4,*B6++(16)   ; |97| 
;*     ||           STDW    .D1T1   A5:A4,*A3++(16)   ; |97| 
;*     ||           SPBR            $C$C94
;*   1              NOP             1
;*   2              ; BRANCHCC OCCURS {$C$C94}        ; |89| 
;*----------------------------------------------------------------------------*

The pipeline can be viewed as in Figure 22. It can be observed that this code would make sure the two store engines are occupied every cycle of the pipeline.

Figure 22. DSP CPU Pipeline Write Software Pipeline

The CGEM (C66x CorePac) L2 cache controller can get up to 4 L2 line allocations in flight. Each allocation brings in 128 bytes. The XMC can get an additional 8 pre-fetch requests (also for 128 bytes) in flight. In the best case, L2 + XMC can get 12 × 128 = 1.5K bytes worth of requests outstanding at once.

NOTE

In terms of bus requests, the L2 cache controller and XMC actually make 64 byte requests, so this is actually 24 64-byte requests in the best case.

In steady state, however, this should drop to 8 × 128 = 1K bytes total in-flight, because XMC only sends additional pre-fetches in response to pre-fetch hits. This is because, the XMC only sends new pre-fetches in two cases: (1) on recognizing a new stream, and (2) on getting hits to an existing stream. In steady state, there are no new streams, so you are only in case (2). In the "100% pre-fetch hit" steady state case, the L2 misses will all hit in the XMC pre-fetch buffer and stop there, and the only traffic leaving XMC will be additional pre-fetches. Thus, the total number of outstanding requests is limited to the total number of outstanding pre-fetches.

To get higher performance, you would need an optimized copy loop that can get 4 L2 misses pipelined up as much as possible, or if XMC pre-fetch is enabled, at least 4 pre-fetch streams active to maximize out the DSP Subsystem busses. In order to emulate this behavior, the following functions were defined that read or write the first 64-bit word of the L2 cache line of 128 bytes. Since they access only the first word of the L2 cache line, they are named L2 Stride-Jmp Copy, L2 Stride-Jmp Read, and L2 Stride-Jmp Write, respectively.