SPRACV1B February 2022 – January 2024 AM2434 , AM6411 , AM6412 , AM6421 , AM6441 , AM6442
LMBench is a suite of microbenchmarks for processor cores and operating system primitives. The memory bandwidth and latency related tests are most relevant for modern embedded processors. The results will vary a a little (< 10%) run to run.
LMBench benchmark bw_mem measures achieved memory copy performance. With parameter cp it does an array copy and bcopy parameter uses the runtime glibc version of memcpy() standard function. I practice the glibc uses a highly optimized implementation that utilizes for example SIMD resulting in higher performance. The size parameter equal to or smaller than the cache size at a given level measures the achievable memory bandwidth from software doing a typical for loop or memcpy() type operation. Typical use is for external memory bandwidth calculation. The bandwidth is calculated as byte read and written counts as 1, which should be roughly half of STREAM copy result. The table below shows the measured bandwidth and the efficiency compared to theoretical wire rate. The wire rate used is the DDR MT/s rate times the width divided by two (read and write making up a copy both consume the bus). The benchmark further allows creating parallel threads with -P parameter. To get the maximum multicore memory bandwidth create the same amount of threads as there are cores available for the operating system, which is 2 for AM64x Linux (-P 2).
Arm Cortex-A53, DDR4-1600MT/s-16 Bit | DDR4 Efficiency | Arm Cortex-A53, LPDDR4-1600MT/s-16bit | LPDDR4 Efficiency | |
---|---|---|---|---|
bw_mem -P 2 8M bcopy (dual core, glibc memcpy) | 1226MB/s | 77% | 1100MB/s | 69% |
bw_mem 8M bcopy (single core, glibc memcpy) |
1016MB/s | 64% | 883MB/s | 55% |
bw_mem -P 2 8M cp (dual core, inline copy loop) |
628MB/s | 39% | 756MB/s | 47% |
bw_mem 8M cp (single core, inline copy loop) |
528MB/s | 33% | 599MB/s | 37% |
LMBench benchmark lat_mem_rd is used to measure observed memory access latency for external memory (DDR4/LPDDR4 on AM64x) and cache hits. The two arguments are size of of the transaction (64 in the screen shot below) and the stride of the read (512). These two values are selected to measure the latency to caches and external memory not the processor data prefetchers or other speculative execution. For some access patterns the prefetching will work, but this benchmark is most useful to measure the case when it does not. The left column is the size of the data access pattern in megabytes, right column is the round trip read latency in nanoseconds. As a summary for Arm Cortex-A53 read latency to:
The below is a run with DDR4, for LPDDR4 the result is the same for L1D and L2 sizes but slightly higher (217ns) for the largest sizes.
root@am6x-evm:~# lat_mem_rd 64 512
"stride=512
0.00049 3.006
0.00098 3.006
0.00195 3.006
0.00293 3.006
0.00391 3.006
0.00586 3.006
0.00781 3.006
0.01172 3.006
0.01562 3.006
0.02344 3.009
0.03125 3.120
0.04688 9.212
0.06250 10.677
0.09375 12.269
0.12500 12.984
0.18750 13.651
0.25000 14.066
0.37500 115.226
0.50000 168.747
0.75000 189.919
1.00000 192.138
1.50000 193.431
2.00000 194.175
3.00000 194.870
4.00000 195.202
6.00000 195.463
8.00000 195.622
12.00000 195.700
16.00000 195.761
24.00000 195.876
32.00000 195.938
48.00000 196.001
64.00000 196.006