SPRAD45A July 2022 – October 2024 AM623 , AM625
LMBench is a suite of microbenchmarks for processor cores and operating system primitives. The memory bandwidth and latency related tests are most relevant for modern embedded processors. The results vary a little (< 10%) run to run.
LMBench benchmark bw_mem measures achieved memory copy performance. The parameter cp does an array copy and bcopy parameter uses the runtime glibc version of memcpy() standard function. The glibc uses a highly optimized implementation that utilizes, for example, SIMD resulting in higher performance. The size parameter equal to or smaller than the cache size at a given level measures the achievable memory bandwidth from software doing a typical for loop or memcpy() type operation. Typical use is for external memory bandwidth calculation. The bandwidth is calculated as byte read and written counts as 1, which is roughly half of STREAM copy result. The measured bandwidth and the efficiency compared to theoretical wire rate is shown in Table 3-1. The wire rate used is the DDR MT/s rate times the width divided by two (read and write making up a copy both consume the bus). The benchmark further allows creating parallel threads with -P parameter. To get the maximum multicore memory bandwidth, create the same amount of threads as there are cores available for the operating system, which is 4 for AM62x Linux (-P 4).
Description | Arm Cortex-A53, DDR4-1600MT/s-16 Bit | DDR4 Efficiency | |
---|---|---|---|
bw_mem -P 2 8M bcopy | quad core, glibc memcpy | 1222MB/s | 76% |
bw_mem 8M bcopy | single core, glibc memcpy | 887MB/s | 55% |
bw_mem -P 4 8M cp | quad core, inline copy loop | 731MB/s | 46% |
bw_mem 8M cp | single core, inline copy loop | 590MB/s | 37% |
LMBench benchmark lat_mem_rd is used to measure the observed memory access latency for external memory (DDR4 & LPDDR4 on AM62x) and cache hits. The two arguments are the size of the transaction (64 in Table 3-2) and the stride of the read (512). These two values are selected to measure the latency to caches and external memory, not the processor data prefetchers or other speculative execution. For access patterns, the prefetching works, but this benchmark is most useful to measure the case when prefetching does not. The left column is the size of the data access pattern in megabytes, right column is the round trip read latency in nanoseconds. As a summary for Arm Cortex-A53 read latency to:
DDR4-1600: | LPDDR4-1600: |
|
|