SPRAD90 Application note

SPRAD90 February 2023 AM62A3 , AM62A3-Q1 , AM62A7 , AM62A7-Q1

3.1.1 LMBench

LMBench is a suite of micro benchmarks for processor cores and operating system primitives. The memory bandwidth and latency related tests are most relevant for modern embedded processors. The results vary a little (< 10%) run to run.

LMBench benchmark bw_mem measures achieved memory copy performance. With parameter cp it does an array copy and bcopy parameter uses the runtime glibc version of memcpy() standard function. The glibc uses a highly optimized implementation that utilizes, for example, SIMD resulting in higher performance. The size parameter equal to or smaller than the cache size at a given level measures the achievable memory bandwidth from software doing a typical for loop or memcpy() type operation. Typical use is for external memory bandwidth calculation. The bandwidth is calculated as byte read and written counts as 1, which is roughly half of STREAM copy result. The benchmark further allows creating parallel threads with -P parameter. To get the maximum multi-core memory bandwidth, create the same amount of threads as there are cores available for the operating system, which is 4 for AM62x Linux (-P 4). To show full performance characterization of the AM62Ax, the LMBench tests are implemented on full factorial combinations of number of cores and clock frequency. The code block below shows terminal printout of executing the LMBench commands.

root@am62axx-evm:~# bw_mem 8M bcopy
8.00 2125.96
root@am62axx-evm:~# bw_mem -P 2 8M bcopy
8.00 3408.74
root@am62axx-evm:~# bw_mem -P 4 8M bcopy
8.00 3884.24

root@am62axx-evm:~# bw_mem 8M cp
8.00 1108.49
root@am62axx-evm:~# bw_mem -P 2 8M cp
8.00 1671.98
root@am62axx-evm:~# bw_mem -P 4 8M cp
8.00 1976.17

Table 3-1 shows the measured bandwidth and the efficiency compared to theoretical wire rate. The wire rate used is the LPDDR4 MT/s rate times the width divided by two (read and write making up a copy both consume the bus).

Equation 1.

E f f i c i e n c y = \frac{M e a s u r e d S p e e d}{\frac{L P D D R 4 M T / s \times w i d t h}{2}} = \frac{M e a s u r e d S p e e d}{\frac{3200 \times 4 B}{2}} = \frac{M e a s u r e d S p e e d}{6400}

Table 3-1 LMBench Results

Command	Description	Arm-Cortex-A53 at 1.25 GHz, LPDDR4-3200MT/s-32 Bit	LPDDR4 Efficiency	Arm-Cortex-A53 at 1.4 GHz, LPDDR4-3200MT/s-32 Bit	LPDDR4 Efficiency
Bw_mem 8M bcopy	Single core, glibc memcpy	2,058 MB/s	32%	2,125 MB/s	33%
bw_mem -P 2 8M bcopy	Dual core, glibc memcpy	3,300 MB/s	52%	3,408 MB/s	53%
bw_mem -P 4 8M bcopy	Quad core, glibc memcpy	3,816 MB/s	60%	3,884 MB/s	61%
Bw_mem 8M cp	Single core, inline copy loop	1,076 MB/s	17%	1,108 MB/s	17%
bw_mem -P 2 8M cp	Dual core, inline copy loop	1,659 MB/s	26%	1,671 MB/s	26%
bw_mem -P 4 8M cp	Single core, inline copy loop	1,952 MB/s	31%	1,976 MB/s	31%

LMBench benchmark lat_mem_rd is used to measure the observed memory access latency for external memory (LPDDR4 on AM62Ax) and cache hits. The two arguments are the size of the transaction (64 in the code block below) and the stride of the read (512). These two values are selected to measure the latency to caches and external memory, not the processor data prefetchers or other speculative execution. For access patterns, the prefetching will work, but this benchmark is most useful to measure the case when it does not.

The code block below shoes the terminal printout of executing lat_mem_rd command. The left column is the size of the data access pattern in megabytes, right column is the round trip read latency in nanoseconds. This command is executed for Arm-Cortex-A53 clock frequency of 1.25 GHz and 1.4 GHz.

root@am62axx-evm:~# lat_mem_rd 64 512
"stride=512
0.00049 2.146
0.00098 2.146
0.00195 2.146
0.00293 2.146
0.00391 2.146
0.00586 2.146
0.00781 2.146
0.01172 2.146
0.01562 2.146
0.02344 2.146
0.03125 2.202
0.04688 6.725
0.06250 7.711
0.09375 8.725
0.12500 9.229
0.18750 9.748
0.25000 10.009
0.37500 22.217
0.50000 23.840
0.75000 88.270
1.00000 116.937
1.50000 133.405
2.00000 135.724
3.00000 137.268
4.00000 137.974
6.00000 138.512
8.00000 138.841
12.00000 139.033
16.00000 139.102
24.00000 139.097
32.00000 139.165
48.00000 139.163
64.00000 139.261

#GUID-BC27F387-9F89-411D-856B-3DBAFDF9C495 shows connected scatter plots of memory latency results for both 1.25 GHz and 1.4 GHz. Based on memory block size (x-axis), the plot can be divided into three regions. The first region is when the accessed memory block is smaller than L1 cache. It is safe to assume that the data is completely inside the L1 and such the latency in this region is a close estimation of L1 cache latency. The second region is when the accessed memory block is bigger than L1 but smaller than L2 cache. The latency in this region is a mix of L1, L2, and LPDDR4 latency. The latency at the middle of that region can be assumed to be a close representation of L2 latency. The third region is when the access memory block is bigger than L2 cache. The last reading in this region reflects the LPDDR4 latency.

Figure 3-1 Memory Read Latency

Table 3-2 shows a summary for Arm-Cortex-A53 read latency.

Table 3-2 Memory Read Latency Results

Memory	Arm-Cortex-A53 at 1.25 GHz	Arm-Cortex-A53 at 1.4 GHz
L1 cache	2.4 ns	2.1 ns
L2 cache	10.3 ns	9.2 ns
LPDDR4-3200 MT/s	140.2 ns	139.2 ns