SPRUJG0 December   2024 F29H850TU , F29H859TU-Q1

 

  1.   1
  2.   Abstract
  3.   Trademarks
  4. 1Introduction
  5. 2Performance Optimization
    1. 2.1 Compiler Settings
      1. 2.1.1 Enabling Debug and Source Inter-listing
      2. 2.1.2 Optimization Control
      3. 2.1.3 Floating-Point Math
      4. 2.1.4 Fixed-Point Division
      5. 2.1.5 Single vs Double Precision Floating-Point
    2. 2.2 Memory Settings
      1. 2.2.1 Executing Code From RAM
      2. 2.2.2 Executing Code From Flash
      3. 2.2.3 Data Placement
    3. 2.3 Code Construction and Configuration
      1. 2.3.1 Inlining
      2. 2.3.2 Intrinsics
      3. 2.3.3 Volatile Variables
      4. 2.3.4 Function Arguments
    4. 2.4 Application Code Optimization
      1. 2.4.1 Optimized SDK Libraries
      2. 2.4.2 Optimizing Code-Size With Libraries
      3. 2.4.3 C29 Special Instructions
      4. 2.4.4 C29 Parallelism
      5. 2.4.5 32-Bit Variables and Writes Preferred
  6. 3References

32-Bit Variables and Writes Preferred

The ECC bits cover 32-bit data, so for write sizes less then 32-bits to RAM , the memory wrapper performs a Read-Modify-Write operation to patch in the new value and re-calculate the ECC for the whole 32-bit word. This leads to stalls when multiple writes of less than 32-bits occur. This is true for most CPUs, including ARM CPUs.

Example: 5 writes take 13 cycles
ST.16 *(ADDR1)(A4+#0x1a),#0x1
ST.16 *(ADDR1)(A4+#0x14),#0x303
ST.8 *(ADDR1)(A4+#0x1e),#0x0
ST.8 *(ADDR1)(A4+#0x16),#0x4
ST.16 *(ADDR1)(A4+#0x1c),#0x0
Note: Application code should minimize writes of less than 32-bits, and in general use 32-bit variables where possible.
Using 32-bit variables also sometimes avoids the compiler adding extra instructions to sign extend 16-bit values. The below example shows an additional instruction the compiler uses to sign-extend a 16-bit value to a 32-bit value.
Example:
int16_t mashup_16(int16_t in_a, int16_t in_b) 
{
int16_t tmp1, tmp2, tmp3, tmp4; 
   tmp1 = in_a + in_b; 
   tmp2 = in_a - in_b; 
   tmp3 = in_b - in_a; 
   tmp3 = tmp1>>(tmp3 &0x7); 
   tmp4 = tmp2<<(tmp1 &0x7);
   return (tmp3 ^ tmp4);
}
Generated code:
20103420 <mashup_16>:
20103420:  33dd 0004         	MV	A4,D0
20103424:  33dd 0025         	MV	A5,D1
20103428:  3204 18a4         	SUB	A6,A5,A4,#0x0
2010342c:  b2e7 b200 3386 0007 20a4 0007 
                            	MV.S16	A7,#0x7
                             ||	ADD	A8,A5,A4,#0x0
                             ||	AND.U16	A6,#0x7
20103438:  33d2 1d07         	AND	A7,A8,A7
2010343c:  b3e4 3204 0108 1085 
                            	SEXT.16	A8,A8
                             ||	SUB	A4,A4,A5,#0x0
20103444:  33d8 1087         	LSL	A4,A4,A7
20103448:  b3d5 7a09 1506    	ASR	A5,A8,A6
                             ||	RETD
2010344e:  33e6 10a4         	XOR	A4,A5,A4
20103452:  33e4 0084         	SEXT.16	A4,A4
20103456:  33e0 0004         	MV	D0,A4
Note: Since all CPU registers are 32-bits and operations on registers are 32-bits, using 32-bit data variables (for time critical code) in general leads to better performant code.