There is load data buffer associated with each VLD instruction in the loop to reduce memory reads and resulting contention and slow-down.
- Size of the buffer is 32 bytes, and is filled with 8 32-bit words on word alignment.
- When the requested data is not contained in the buffer (completely or partially), VCOP reads another 8 32-bit words from the first requested word.
- There is no sharing of data among separate load data buffers.
There is one common store data buffer, basically a shadow copy of all vector registers, for all the VST instructions in the loop to skew memory write timing to reduce memory contentions and resulting slow-down.
- The store data buffer is 16 × 8 × 40-bit in size.
- Vector registers are copied to the store data buffer at the end of i4 iteration, when there is any store to be performed for that iteration. In case stores are not all completed then, load and execution stages are stalled, until all stores are completed.
- Write data from multiple iterations of the same store instruction are not combined. For example, STB_NPT_ALWS causes a memory store of 8 bytes in each iteration.
- Write data from multiple stores of the loop, same iteration, are not combined.
Memory loads and stores are arbitrated for each memory port. There are three memory ports:
- WBUF: Dedicated to WBUF
- IBUFL: For IBULA and IBUFLB
- IBUFH: For IBUFHA and IBUFHB
Load store priorities are: force store > load > store. Force store is issued when the load and operation stages are stalled to free up the store stage so the hardware can advance to the next iteration. Other than force store, load has priority over normal store.