SPRUI30H November 2015 – May 2024 DRA745 , DRA746 , DRA750 , DRA756
Software cannot issue VLDH_DINTRLV on odd halfwords, so use two LDH_NPT.
Cycle | LD stage | OP stage | ST stage |
---|---|---|---|
0 | iter 0 LD0 need 0x10000..0x1000F, read IBUFLA LD0_buf = {0x10000..0x1001F} LD1 need 0x10080 .. 0x1008F, stalled LD2 need 0..1, read WBUF LD2_buf = {0x0..0x1F} | ||
1 | iter 0 LD1 read IBUFLA LD1_buf = {0x10080..0x1009F} | ||
2 | iter 1 LD0 need 0x10002..0x10011, from LD0_buf LD1 need 0x10082..0x10091, from LD1_buf LD2 need 2..3, from LD2_buf | iter 0 MADD || MADD | |
3 | iter 2 LD0 need 0x10004..0x10013, from LD0_buf LD1 need 0x10084..0x10093, from LD1_buf LD2 need 4..5, from LD2_buf | iter 1 MADD || MADD | iter 0 |
4 | iter 3 LD0 need 0x10010..0x1001F, from LD0_buf LD1 need 0x10090 .. 0x1009F, from LD1_buf LD2 need 0..1, from LD2_buf | iter 2 MADD || MADD | iter 1 |
5 | iter 4 LD0 need 0x10012..0x10021, read IBUFLA LD0_buf = {0x10010..0x1002F} LD1 need 0x10092 .. 0x100A1, stall LD2 need 2..3, from LD2_buf | iter 3 MADD || MADD | iter 2 ST0 store 0x10400..0x1040F ST0 queued ST1 store 0x10480..0x1048F ST1 queued |
6 | iter 4 LD1 read IBUFLA LD1_buf = {0x10090..0x100AF} | stalled | stalled |
7 | iter 5 LD0 need 0x10014..0x10023, from LD0_buf LD1 need 0x10094..0x100A3, from LD1_buf LD2 need 4..5, from LD2_buf | iter 4 MADD || MADD | iter 3 ST0 write IBUFLA |
8 | iter 6 LD0 need 0x10020..0x1002F, from LD0_buf LD1 need 0x100A0..0x100AF, from LD1_buf LD2 need 0..1, from LD2_buf | iter 5 MADD || MADD | iter 4 ST1 write IBUFLA |
9 | iter 7 LD0 need 0x10022..0x10031, read IBUFLA LD0_buf = {0x20..x3F} LD1 need 0x100A2..0x100B1, stall LD2 need 2..3, from LD2_buf | iter 6 | iter 5 ST0 store 0x10410..0x1041F ST0 queuedST1 store 0x10490..0x1049F ST1 queued |
10 | iter 7 LD1 read IBUFLA LD1_buf = {0x100A0..0x100BF} | stalled | stalled |
In this loop, with 2 VLDH_NPT to sustain 16 multiply-accumulates per iteration, the load stage stalled due to memory contention in IBUFL from the 2 loads. The read buffer supplies data for 2 subsequent iterations in steady state, leaving 2 memory-read free cycles every 3 iterations. With output writing back to IBUFL, the store buffer effectively delays the memory writes to use these free memory slots. Each i4 loop of 3 iterations thus takes 4 cycles to complete.