SPRUIG8J January 2018 – March 2024
The compiler can unroll an outer loop that encloses an innermost loop. This transformation makes an extra iteration of the outer loop, and as a result, there is another copy of the inner loop. The second "inner loop" is then "fused" back into the original inner loop. As a result, the fused inner loop performs two iterations of the outer loop for each execution of the inner loop. This transformation is called "unroll-and-jam" and can increase available parallelism and function unit utilization.
The compiler can perform unroll-and-jam if the compiler detects that there is not sufficient parallelism available in the inner loop to effectively utilize the computational resources on the CPU.
This type of optimization is performed if both the --opt_for_speed (-mf) option is set to level 3 or higher (level 4 is the default) and the --opt_level (-o) option is set to any level other than "off" (off is the default if --vectypes=off). This optimization can improve performance, but results in increased code size and reduced debuggability.