SPRUI30H November 2015 – May 2024 DRA745 , DRA746 , DRA750 , DRA756
Cycle | LD stage | OP stage | ST stage |
---|---|---|---|
0 | iter 0 LD0 need 0x10000..0x1000F, read IBUFLA LD0_buf = {0x10000..0x1001F} LD2 need 0x0, read WBUF LD2_buf = {0x0..0x1F} | ||
1 | iter 1 LD0 need 0x10001..0x10010, from LD0_buf LD2 need 0x1, from LD2_buf | iter 0 MADD || MADD | |
2 | iter 2 LD0 need 0x10002..0x10011, from LD0_buf LD2 need 0x2, from LD2_buf | iter 1 MADD || MADD | iter 0 |
3 | iter 3 LD0 need 0x10010..0x1001F, from LD0_buf LD2 need 0x0, from LD2_buf | iter 2 MADD || MADD | iter 1 |
4 | iter 4 LD0 need 0x10011..0x10020, read IBUFLA LD0_buf = {0x10010..0x1002F} LD2 need 0x1, from LD2_buf | iter 3 MADD || MADD | iter 2 ST0 store 0x10400..0x1040F queue the store |
5 | iter 5 LD0 need 0x10012..0x10021, from LD0_buf LD2 need 0x2, from LD2_buf | iter 4 MADD || MADD | iter 3 store IBUFLA |
6 | iter 6 LD0 need 0x10020..0x1002F, from LD0_buf LD2 need 0x0, from LD2_buf | iter 5 MADD || MADD | iter 4 |
7 | iter 7 LD0 need 0x10021..0x10030, read IBUFLA LD0_buf = {0x10020..0x1003F} LD2 need 0x1, from LD2_buf | iter 6 MADD || MADD | iter 5 ST0 store 0x10410..0x1041F queue the store |
Here LDB_DINTRLV is used to read 16 data points, to operate on 16 outputs in parallel. The hardware issues memory reads for iteration 0, 4, 7, etc, is able to supply data for subsequent hold data read in for subsequent LD; on average there is 1 read every 3 cycles. For coefficient, since only 3 bytes is used, just one read in iteration 0 is sufficient to supply data for the whole loop. Output is produced in iteration 2, 5, etc, and the store buffer delays the write until there is no read traction, avoiding read/write contention and slowdown.
Due to memory read/write timing, delayed slot in operation, and round/saturate feature in the store stage, the skew between load, operation, and store stages is more than what is shown in Table 8-360, but for calculating steady-state performance, simplified analysis like this is sufficient.