SPRAC21 Application note | TI.com

SPRAC21A June 2016 – June 2019 OMAP-L132 , OMAP-L138 , TDA2E , TDA2EG-17 , TDA2HF , TDA2HG , TDA2HV , TDA2LF , TDA2P-ABZ , TDA2P-ACD , TDA2SA , TDA2SG , TDA2SX , TDA3LA , TDA3LX , TDA3MA , TDA3MD , TDA3MV

6.3 Summary

Based on the observations made in Section 6.2, Table 31 lists the factors that affect the DSP CPU RD WR performance.

Table 31. Factors Affecting DSP CPU RD-WR Performance

Factors	Impact	General Recommendation
Source/Destination Memory	The transfer speed depends on SRC/ DST memory bandwidth.	Know the nature of the source and destination memory, specifically the frequency of operation and the bus width.
Transfer Size versus Cache Size	Larger data buffers written to than the cache size introduces a L2 cache line write back along with the L2 cache line reads at write allocate at the MDMA port.	Expect drop in performance when the data buffer size written to is larger than the L2 cache size.
Code Optimization	The more the load and store units in the DSP core are occupied the better will be the CPU read and write performance.	You would need an optimized copy loop that can get 4 L2 misses pipelined up as much as possible, or if XMC pre-fetch is enabled, at least 4 pre-fetch streams active to max out the DSP Subsystem MDMA bus. Use the compiler options and pipelined loops to achieve this.
MAR Register Pre-fetch Enable	Improves the CPU RD-WR throughput.	Enable pre-fetch in the MAR register for better CPU RD-WR throughput.
C66xOSS_BUS_CONFIG: MDMA posted versus non-posted writes	Posted writes give better performance than the non-posted writes.	Enable posted writes whenever you do not expect race conditions when the data would be read even before the memory gets updated.
MMU Enable	Enabling MMU leads to slight drop in CPU RD-WR throughput.
MAR Register Cache ability	Improves the CPU RD-WR performance when regions are made cacheable.	Set the MAR cacheable bit for regions accessed by the DSP CPU.
Maximizing cache line reuse	Improves the CPU RD-WR performance.	The same memory locations within a cached line should be reused as often as possible. Either the same data can be reread or new data written to already cache locations so that subsequent reads will hit.
Eviction of a line	Avoiding eviction of a line as long as it is being reused improves the CPU RD-WR performance.
Stall cycles per miss	Reducing the number of stall cycles per miss improves the CPU RD-WR performance.	This can be achieved by exploiting miss pipelining.