SPRAC21A June   2016  – June 2019 OMAP-L132 , OMAP-L138 , TDA2E , TDA2EG-17 , TDA2HF , TDA2HG , TDA2HV , TDA2LF , TDA2P-ABZ , TDA2P-ACD , TDA2SA , TDA2SG , TDA2SX , TDA3LA , TDA3LX , TDA3MA , TDA3MD , TDA3MV

 

  1.   TDA2xx and TDA2ex Performance
    1.     Trademarks
    2. SoC Overview
      1. 1.1 Introduction
      2. 1.2 Acronyms and Definitions
      3. 1.3 TDA2xx and TDA2ex System Interconnect
      4. 1.4 Traffic Regulation Within the Interconnect
        1. 1.4.1 Bandwidth Regulators
        2. 1.4.2 Bandwidth Limiters
        3. 1.4.3 Initiator Priority
      5. 1.5 TDA2xx and TDA2ex Memory Subsystem
        1. 1.5.1 Controller/PHY Timing Parameters
        2. 1.5.2 Class of Service
        3. 1.5.3 Prioritization Between DMM/SYS PORT or MPU Port to EMIF
      6. 1.6 TDA2xx and TDA2ex Measurement Operating Frequencies
      7. 1.7 System Instrumentation and Measurement Methodology
        1. 1.7.1 GP Timers
        2. 1.7.2 L3 Statistic Collectors
    3. Cortex-A15
      1. 2.1 Level1 and Level2 Cache
      2. 2.2 MMU
      3. 2.3 Performance Control Mechanisms
        1. 2.3.1 Cortex-A15 Knobs
        2. 2.3.2 MMU Page Table Knobs
      4. 2.4 Cortex-A15 CPU Read and Write Performance
        1. 2.4.1 Cortex-A15 Functions
        2. 2.4.2 Setup Limitations
        3. 2.4.3 System Performance
          1. 2.4.3.1 Cortex-A15 Stand-Alone Memory Read, Write, Copy
          2. 2.4.3.2 Results
    4. System Enhanced Direct Memory Access (System EDMA)
      1. 3.1 System EDMA Performance
        1. 3.1.1 System EDMA Read and Write
        2. 3.1.2 System EDMA Results
      2. 3.2 System EDMA Observations
    5. DSP Subsystem EDMA
      1. 4.1 DSP Subsystem EDMA Performance
        1. 4.1.1 DSP Subsystem EDMA Read and Write
        2. 4.1.2 DSP Subsystem EDMA Results
      2. 4.2 DSP Subsystem EDMA Observations
    6. Embedded Vision Engine (EVE) Subsystem EDMA
      1. 5.1 EVE EDMA Performance
        1. 5.1.1 EVE EDMA Read and Write
        2. 5.1.2 EVE EDMA Results
      2. 5.2 EVE EDMA Observations
    7. DSP CPU
      1. 6.1 DSP CPU Performance
        1. 6.1.1 DSP CPU Read and Write
        2. 6.1.2 Code Setup
          1. 6.1.2.1 Pipeline Copy
          2. 6.1.2.2 Pipeline Read
          3. 6.1.2.3 Pipeline Write
          4. 6.1.2.4 L2 Stride-Jmp Copy
          5. 6.1.2.5 L2 Stride-Jmp Read
          6. 6.1.2.6 L2 Stride-Jmp Write
      2. 6.2 DSP CPU Observations
      3. 6.3 Summary
    8. Cortex-M4 (IPU)
      1. 7.1 Cortex-M4 CPU Performance
        1. 7.1.1 Cortex-M4 CPU Read and Write
        2. 7.1.2 Code Setup
        3. 7.1.3 Cortex-M4 Functions
        4. 7.1.4 Setup Limitations
      2. 7.2 Cortex-M4 CPU Observations
        1. 7.2.1 Cache Disable
        2. 7.2.2 Cache Enable
      3. 7.3 Summary
    9. USB IP
      1. 8.1 Overview
      2. 8.2 USB IP Performance
        1. 8.2.1 Test Setup
        2. 8.2.2 Results and Observations
        3. 8.2.3 Summary
    10. PCIe IP
      1. 9.1 Overview
      2. 9.2 PCIe IP Performance
        1. 9.2.1 Test Setup
        2. 9.2.2 Results and Observations
    11. 10 IVA-HD IP
      1. 10.1 Overview
      2. 10.2 H.264 Decoder
        1. 10.2.1 Description
        2. 10.2.2 Test Setup
        3. 10.2.3 Test Results
      3. 10.3 MJPEG Decoder
        1. 10.3.1 Description
        2. 10.3.2 Test Setup
        3. 10.3.3 Test Results
    12. 11 MMC IP
      1. 11.1 MMC Read and Write Performance
        1. 11.1.1 Test Description
        2. 11.1.2 Test Results
      2. 11.2 Summary
    13. 12 SATA IP
      1. 12.1 SATA Read and Write Performance
        1. 12.1.1 Test Setup
        2. 12.1.2 Observations
          1. 12.1.2.1 RAW Performance
          2. 12.1.2.2 SDK Performance
      2. 12.2 Summary
    14. 13 GMAC IP
      1. 13.1 GMAC Receive/Transmit Performance
        1. 13.1.1 Test Setup
        2. 13.1.2 Test Description
          1. 13.1.2.1 CPPI Buffer Descriptors
        3. 13.1.3 Test Results
          1. 13.1.3.1 Receive/Transmit Mode (see )
          2. 13.1.3.2 Receive Only Mode (see )
          3. 13.1.3.3 Transmit Only Mode (see )
      2. 13.2 Summary
    15. 14 GPMC IP
      1. 14.1 GPMC Read and Write Performance
        1. 14.1.1 Test Setup
          1. 14.1.1.1 NAND Flash
          2. 14.1.1.2 NOR Flash
        2. 14.1.2 Test Description
          1. 14.1.2.1 Asynchronous NAND Flash Read/Write Using CPU Prefetch Mode
          2. 14.1.2.2 Asynchronous NOR Flash Single Read
          3. 14.1.2.3 Asynchronous NOR Flash Page Read
          4. 14.1.2.4 Asynchronous NOR Flash Single Write
        3. 14.1.3 Test Results
      2. 14.2 Summary
    16. 15 QSPI IP
      1. 15.1 QSPI Read and Write Performance
        1. 15.1.1 Test Setup
        2. 15.1.2 Test Results
        3. 15.1.3 Analysis
          1. 15.1.3.1 Theoretical Calculations
          2. 15.1.3.2 % Efficiency
      2. 15.2 QSPI XIP Code Execution Performance
      3. 15.3 Summary
    17. 16 Standard Benchmarks
      1. 16.1 Dhrystone
        1. 16.1.1 Cortex-A15 Tests and Results
        2. 16.1.2 Cortex-M4 Tests and Results
      2. 16.2 LMbench
        1. 16.2.1 LMbench Bandwidth
          1. 16.2.1.1 TDA2xx and TDA2ex Cortex-A15 LMbench Bandwidth Results
          2. 16.2.1.2 TDA2xx and TDA2ex Cortex-M4 LMBench Bandwidth Results
          3. 16.2.1.3 Analysis
        2. 16.2.2 LMbench Latency
          1. 16.2.2.1 TDA2xx and TDA2ex Cortex-A15 LMbench Latency Results
          2. 16.2.2.2 TDA2xx and TDA2ex Cortex-M4 LMbench Latency Results
          3. 16.2.2.3 Analysis
      3. 16.3 STREAM
        1. 16.3.1 TDA2xx and TDA2ex Cortex-A15 STREAM Benchmark Results
        2. 16.3.2 TDA2xx and TDA2ex Cortex-M4 STREAM Benchmark Results
    18. 17 Error Checking and Correction (ECC)
      1. 17.1 OCMC ECC Programming
      2. 17.2 EMIF ECC Programming
      3. 17.3 EMIF ECC Programming to Starterware Code Mapping
      4. 17.4 Careabouts of Using EMIF ECC
        1. 17.4.1 Restrictions Due to Non-Availability of Read Modify Write ECC Support in EMIF
          1. 17.4.1.1 Un-Cached CPU Access of EMIF
          2. 17.4.1.2 Cached CPU Access of EMIF
          3. 17.4.1.3 Non CPU Access of EMIF Memory
          4. 17.4.1.4 Debugger Access of EMIF via the Memory Browser/Watch Window
          5. 17.4.1.5 Software Breakpoints While Debugging
        2. 17.4.2 Compiler Optimization
        3. 17.4.3 Restrictions Due to i882 Errata
        4. 17.4.4 How to Find Who Caused the Unaligned Quanta Writes After the Interrupt
      5. 17.5 Impact of ECC on Performance
    19. 18 DDR3 Interleaved vs Non-Interleaved
      1. 18.1 Interleaved versus Non-Interleaved Setup
      2. 18.2 Impact of Interleaved vs Non-Interleaved DDR3 for a Single Initiator
      3. 18.3 Impact of Interleaved vs Non-Interleaved DDR3 for Multiple Initiators
    20. 19 DDR3 vs DDR2 Performance
      1. 19.1 Impact of DDR2 vs DDR3 for a Single Initiator
      2. 19.2 Impact of DDR2 vs DDR3 for Multiple Initiators
    21. 20 Boot Time Profile
      1. 20.1 ROM Boot Time Profile
      2. 20.2 System Boot Time Profile
    22. 21 L3 Statistics Collector Programming Model
    23. 22 Reference
  2.   Revision History

DSP Subsystem EDMA Observations

NOTE

On the TDA2xx and TDA2ex device, all DSP subsystem transfer controllers yield identical performance for all transfer scenarios because both TC have the same configuration, and most importantly the same FIFOSIZE for a given burst size.

EDMA channel parameters allow many different transfer configurations. Typical transfer configurations result in transfer controllers bursting the read write data in default burst size chunks, thereby, keeping the busses fully utilized. However, in some configurations, the TC issues less than optimally sized read/write commands (less than default burst size), reducing performance. To properly design a system, it is important to know which configurations offer the best performance for high-speed operations.

On TDA2xx and TDA2ex, there are two transfer controllers to move data between slave end points. The default configuration for the transfer controllers is shown in Table 21.

Table 21. Default Configuration for the Transfer Controllers

Name Description TC0 TC1
TCCFG[2:0] FIFOSIZE Channel FIFO Size 1024 Bytes 1024 Bytes
TCCFG[5:4] BUSWIDTH Data Transfer Bus Width 16 Bytes 16 Bytes
TCCFG[9:8] DSTREGDEPTH Destination Register Depth 4 entries 4 entries
DBS (Default Burst Size) Size of each data burst Configurable Configurable

The individual TC performance for paging/memory to memory transfers is essentially dictated by the TC configuration. In most scenarios, the FIFOSIZE and default burst size configuration for the TC have the most significant impact on the TC performance; the BUSWIDTH configuration is dependent on the device architecture and the DSTREGDEPTH values impact the number of in-flight transfers.

dsp_edma_tptc_bd_sprac21.gifFigure 15. DSP EDMA Third-party Transfer Controller (EDMA_TPTC) Block Diagram

The default burst size (DBS) can be controlled with the C66x_OSS_BUS_CONFIG register in the TDA2xx and TDA2ex DSP Subsystem OCP Registers, as shown in Table 22.

Table 22. C66x_OSS_BUS_CONFIG

Address offset 0x0000 0014
Physical Address 0x1D0 0014 (DSP View) Instance C66x_OCP_REGISTERS
Description Bus Configuration
Type RW
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
RESERVED SDMA_PRI RESERVED NOPOSTOVERRIDE RESERVED SDMA_L2PRES RESERVED CFG_L2PRES RESERVED TC1_L2PRES RESERVED TC0_L2PRES RESERVED TC1_DBS RESERVED TC0_DBS
Bits Field Name Description Type Reset
31 RESERVED Reserved R 0x0
30:28 SDMA_PRI Sets the CBA/VBusM Priority for the CGEM SDMA port. Can typically be left at default value. R 0x4
27:25 RESERVED Reserved R 0x0
24 NOPOSTOVERRIDE Non-Posted writes setting RW 0x1
23:22 RESERVED Reserved R 0x0
21:20 SDMA_L2PRES OCP Slave port L2 interconnect pressure driven on ocp mflag to control arbitration within the L2 interconnect RW 0x0
19:18 RESERVED Reserved R 0x0
17:16 CFG_L2PRES CGEM CFG L2 interconnect pressure driven on ocp mflag to control arbitration within the L2 interconnect RW 0x0
15:14 RESERVED Reserved R 0x0
13:12 TC1_L2PRES TC1 L2 interconnect pressure driven on ocp mflag to control arbitration within the L2 interconnect RW 0x0
11:10 RESERVED Reserved R 0x0
9:8 TC0_L2PRES TC0 L2 interconnect pressure driven on ocp mflag to control arbitration within the L2 interconnect RW 0x0
7:6 RESERVED Reserved R 0x0
5:4 TC1_DBS TC1 Default Burst size setting RW 0x3
3:2 RESERVED Reserved R 0x0
1:0 TC0_DBS TC0 Default Burst size setting RW 0x3

The TC read and write controllers in conjunction with the source and destination register sets are responsible for issuing optimally-sized reads and writes to the slave endpoints. An optimally-sized command is defined by the transfer controller default burst size (DBS).

The EDMA_TPTC attempts to issue the largest possible command size as limited by the DBS value or the ABCNT_n[15:0] ACNT and ABCNT_n[31:16] BCNT value of the TR. EDMA_TPTC obeys the following rules: The read/write controllers always issue commands less than or equal to the DBS value. The first command of a 1D transfer command always aligns the address of subsequent commands to the DBS value.

Table 23 lists the TR segmentation rules that are followed by the EDMA_TPTC. In summary, if the ABCNT_n[15:0] ACNT value is larger than the DBS value, then the EDMA_TPTC breaks the ABCNT_n[15:0] ACNT array into DBS-sized commands to the source/destination addresses. Each ABCNT_n[31:16] BCNT number of arrays are then serviced in succession.

For BCNT arrays of ACNT bytes (that is, a 2D transfer), if the ABCNT_n[15:0] ACNT value is less than or equal to the DBS value, then the TR may be optimized into a 1D-transfer in order to maximize efficiency. The optimization takes place if the EDMA_TPTC recognizes that the 2D-transfer is organized as a single dimension (ABCNT_n[15:0] ACNT == BIDX_n) and the ACNT value is a power of 2.

Table 23. DSP EDMA TC Optimization Rules

ACNT ≤ DBS ACNT is Power of 2 BIDX = ACNT BCNT ≤ 1023 SAM/DAM = Increment Description
Yes Yes Yes Yes Yes Optimized
No X X X X Not Optimized
X No X X X Not Optimized
X X No X X Not Optimized
X X X No X Not Optimized
X X X X No Not Optimized

In summary, Table 24 lists the factors that affect the EDMA performance.

Table 24. Factors Affecting System EDMA Performance

Factors Impact General Recommendation
Source/Destination Memory The transfer speed depends on SRC/DST memory bandwidth. Know the nature of the source and destination memory, specifically the frequency of operation and the bus width.
Transfer Size Throughput is less for small transfers due to transfer overhead/latency. Configure EDMA for larger transfer size as throughput, small transfer size is dominated by transfer overhead.
A-Sync/AB-Sync Performance depends on the number of TRs (Transfer Requests). More TRs would mean more overhead. Using AB-Sync transfers gives better performance than chaining A-Sync transfers.
Source/Destination Bidx Optimization will not be done if BIDX is not equal to ACNT value optimization guidelines. Whenever possible, follow the EDMA TC optimization guidelines. See the TPTC spec for optimization details.
Queue TC Usage Performance is the same for both TCs. Both TCs have the same configuration and show the same performance.
Burst Size Decides the largest possible read/write command submission by TC. The default burst size for all transfer controllers is 128 bytes. This also results in most efficient transfers/throughput in most memory-to-memory transfer scenarios.
Source/Destination Alignment Slight performance degradation if source/destination are not aligned to Default Burst Size (DBS) boundaries. For smaller transfers, as much as possible, source and destination addresses should be aligned across DBS boundaries.