Optimizations that improve the loading and storing of data are often crucial to the
performance of an application. A detailed examination of useful memory optimizations
on Keystone 3 devices is outside the scope of this document. However, the following
are the most common optimizations used to aid memory system throughput and reduce
memory hierarchy latency.
- Blocking: Input, output,
and temporary arrays/objects are often too large to fit into Multicore Shared
Memory Controller (MSMC) or L2 memory. For example, when performing an algorithm
over an entire 1000x1000 pixel image, the image is too large to fit into most or
all configurations of L2 memory, and the algorithm may thrash the caches,
leading to poor performance. Keeping the data as close to the CPU as possible
improves memory system performance, but how do we do this when the image is too
large to fit into the L2 cache? Depending on the algorithm, it may be useful to
use a technique called "blocking," in which the algorithm is modified to operate
on only a portion of the data at a given time. Once that "block" of data is
processed, the algorithm moves to the next block. This technique is often paired
with the other techniques in this list.
- Direct Memory Access (DMA): Consider using the asynchronous DMA
capabilities of the device to move new data into MSMC memory or L2 memory and
DMA to move processed data out. This frees the C7000 CPU to perform computations
while the DMA is readying data for the next frame, block, or layer.
- Ping-Pong Buffers: Consider using ping-pong memory buffers so that the
C7000 CPU is processing data in one buffer while a DMA transfer is occurring
to/from another buffer. When the C7000 CPU is finished processing the first
buffer, the algorithm switches to the second buffer, which now has new data as a
result of a DMA transfer. Consider placing these buffers in MSMC or L2 memory,
which is much faster than DDR memory.