SPRAD45 Application note

SPRAD45A July 2022 – October 2024 AM623 , AM625

3.1.1 LMBench

LMBench is a suite of microbenchmarks for processor cores and operating system primitives. The memory bandwidth and latency related tests are most relevant for modern embedded processors. The results vary a little (< 10%) run to run.

LMBench benchmark bw_mem measures achieved memory copy performance. The parameter cp does an array copy and bcopy parameter uses the runtime glibc version of memcpy() standard function. The glibc uses a highly optimized implementation that utilizes, for example, SIMD resulting in higher performance. The size parameter equal to or smaller than the cache size at a given level measures the achievable memory bandwidth from software doing a typical for loop or memcpy() type operation. Typical use is for external memory bandwidth calculation. The bandwidth is calculated as byte read and written counts as 1, which is roughly half of STREAM copy result. The measured bandwidth and the efficiency compared to theoretical wire rate is shown in Table 3-1. The wire rate used is the DDR MT/s rate times the width divided by two (read and write making up a copy both consume the bus). The benchmark further allows creating parallel threads with -P parameter. To get the maximum multicore memory bandwidth, create the same amount of threads as there are cores available for the operating system, which is 4 for AM62x Linux (-P 4).

Table 3-1 LMBench Results

	Description	Arm Cortex-A53, DDR4-1600MT/s-16 Bit	DDR4 Efficiency
bw_mem -P 2 8M bcopy	quad core, glibc memcpy	1222MB/s	76%
bw_mem 8M bcopy	single core, glibc memcpy	887MB/s	55%
bw_mem -P 4 8M cp	quad core, inline copy loop	731MB/s	46%
bw_mem 8M cp	single core, inline copy loop	590MB/s	37%

LMBench benchmark lat_mem_rd is used to measure the observed memory access latency for external memory (DDR4 & LPDDR4 on AM62x) and cache hits. The two arguments are the size of the transaction (64 in Table 3-2) and the stride of the read (512). These two values are selected to measure the latency to caches and external memory, not the processor data prefetchers or other speculative execution. For access patterns, the prefetching works, but this benchmark is most useful to measure the case when prefetching does not. The left column is the size of the data access pattern in megabytes, right column is the round trip read latency in nanoseconds. As a summary for Arm Cortex-A53 read latency to:

L1D is 2.5ns
L2 latency is 11.5ns
For access to DDR4-1600, latency is 209ns
For access to LPDDR4-1600, latency is 218ns

Table 3-2 LMBench Benchmarks for DDR4 and LPDDR4

DDR4-1600: LPDDR4-1600:

root@am62xx-evm:~ 
#lat_mem_rd 64 512
"stride=512
0.00049 2.503
0.00098 2.504
0.00195 2.503
0.00293 2.503
0.00391 2.503
0.00586 2.503
0.00781 2.504
0.01172 2.503
0.01562 2.503
0.02344 2.520
0.03125 2.562
0.04688 7.673
0.06250 8.980
0.09375 10.190
0.12500 10.772
0.18750 11.374
0.25000 11.675
0.37500 11.969
0.50000 12.784
0.75000 140.541
1.00000 179.407
1.50000 192.142
2.00000 197.091
3.00000 202.542
4.00000 205.342
6.00000 207.528
8.00000 208.155
12.00000 209.024
16.00000 209.193
24.00000 209.510
32.00000 209.754
48.00000 209.919
64.00000 209.947

root@am62xx-lp-evm:~  #lat_mem_rd 64 512
"stride=512
0.00049 2.404
0.00098 2.404
0.00195 2.404
0.00293 2.404
0.00391 2.404
0.00586 2.404
0.00781 2.404
0.01172 2.404
0.01562 2.404
0.02344 2.404
0.03125 4.658
0.04688 7.361
0.06250 8.649
0.09375 9.829
0.12500 10.425
0.18750 10.902
0.25000 11.206
0.37500 19.735
0.50000 45.997
0.75000 142.079
1.00000 192.943
1.50000 211.722
2.00000 214.697
3.00000 216.157
4.00000 217.630
6.00000 217.874
8.00000 218.525
12.00000 218.666
16.00000 218.752
24.00000 218.732
32.00000 218.727
48.00000 218.696
64.00000 218.854