SPRAC21A June 2016 – June 2019 OMAP-L132 , OMAP-L138 , TDA2E , TDA2EG-17 , TDA2HF , TDA2HG , TDA2HV , TDA2LF , TDA2P-ABZ , TDA2P-ACD , TDA2SA , TDA2SG , TDA2SX , TDA3LA , TDA3LX , TDA3MA , TDA3MD , TDA3MV
Based on the observations made in Section 6.2, Table 31 lists the factors that affect the DSP CPU RD WR performance.
Factors | Impact | General Recommendation |
---|---|---|
Source/Destination Memory | The transfer speed depends on SRC/ DST memory bandwidth. | Know the nature of the source and destination memory, specifically the frequency of operation and the bus width. |
Transfer Size versus Cache Size | Larger data buffers written to than the cache size introduces a L2 cache line write back along with the L2 cache line reads at write allocate at the MDMA port. | Expect drop in performance when the data buffer size written to is larger than the L2 cache size. |
Code Optimization | The more the load and store units in the DSP core are occupied the better will be the CPU read and write performance. | You would need an optimized copy loop that can get 4 L2 misses pipelined up as much as possible, or if XMC pre-fetch is enabled, at least 4 pre-fetch streams active to max out the DSP Subsystem MDMA bus. Use the compiler options and pipelined loops to achieve this. |
MAR Register Pre-fetch Enable | Improves the CPU RD-WR throughput. | Enable pre-fetch in the MAR register for better CPU RD-WR throughput. |
C66xOSS_BUS_CONFIG: MDMA posted versus non-posted writes | Posted writes give better performance than the non-posted writes. | Enable posted writes whenever you do not expect race conditions when the data would be read even before the memory gets updated. |
MMU Enable | Enabling MMU leads to slight drop in CPU RD-WR throughput. | |
MAR Register Cache ability | Improves the CPU RD-WR performance when regions are made cacheable. | Set the MAR cacheable bit for regions accessed by the DSP CPU. |
Maximizing cache line reuse | Improves the CPU RD-WR performance. | The same memory locations within a cached line should be reused as often as possible. Either the same data can be reread or new data written to already cache locations so that subsequent reads will hit. |
Eviction of a line | Avoiding eviction of a line as long as it is being reused improves the CPU RD-WR performance. | |
Stall cycles per miss | Reducing the number of stall cycles per miss improves the CPU RD-WR performance. | This can be achieved by exploiting miss pipelining. |