SPRUI04F july 2015 – april 2023
Effective utilization of the program instruction cache is an important part of getting the best performance from a C6000. The dedicated program instruction cache (L1P) provides fast instruction fetches, but a cache miss can be very costly. Some applications (e.g. h264) can spend 30%+ of the processor's time stalling due to L1P cache misses. A cache miss occurs when a fetch fails to read an instruction from L1P and the process is required to access the instruction from the next level of memory. A request to L2 or external memory has a much higher latency than an access from L1P.
Careful placement of code sections can greatly reduce the number of cache misses. The C6000 L1P is especially sensitive to code placement because it is direct-mapped.
Many L1P cache misses are conflict misses. Conflict misses occur when the cache has recently evicted a block of code that is then needed again. In a program instruction cache this often occurs when two frequently executed blocks of code (usually from different functions) interleave their execution and are mapped to the same cache line.
For example, suppose there is a call to function B from inside a loop in function A. Suppose also that the code for function A's loop is mapped to the same cache line as a block of code from function B that is executed every time that B is called. Each time B is called from within this loop, the loop code in function A is evicted from the cache by the code in B that is mapped to the same cache line. Even worse, when B returns to A, the loop code in A evicts the code from function B that is mapped to the same cache line.
Every iteration through the loop will cause two program instruction cache conflict misses. If the loop is heavily traversed, then the number of processor cycles lost to program instruction cache stalls can become quite large.
Many program instruction cache conflict misses can be avoided with more intelligent placement of functions that are active at the same time. Program instruction cache efficiency can be significantly improved using code placement strategies that utilize dynamic profile information that is gathered during the run of an instrumented application.
The program cache layout tool (clt6x) takes dynamic profile information in the form of a weighted call graph and creates a preferred function order command file that can be used as input to the linker to guide the placement of function subsections.
You can use the program cache layout tool to help improve your program locality and reduce the number of L1P cache conflict misses that occur during the run of your application, thereby improving your application's performance.