SPRUIE9D May 2017 – May 2024 DRA74P , DRA75P , DRA76P , DRA77P
The VST instruction specifies the data distribution, data type, base address (index to parameter file), agen, the vector register to get the store from, loop level when stores are to occur, and rounding/saturation configuration. Format of VST is:
[pred] VSTtype_distribution_wr_loop vreg, base[agen], RND_SAT: rnd_sat_param |
Storing of up to N data points from vector register to local memory is supported, with each data point being a byte, short, or long (32-bit) word, either signed or unsigned. Signed/unsigned information is used in performing saturation.
There are up to eight VST instructions in one loop, and each VST can be from the full vector register file, V0..V15. It is allowed to overlap the source registers, i.e., having more than one VST instructions storing out the same vector register. Also, in the extreme case of having all eight VST being interleaving stores, all 16 vector registers can be written out.
There is a further constraint on source registers for vector store. Only V0, V1, V2, and V3 can be stored without being the destination of any operation.
VCOP supports the following store patterns:
[pred] VSTtype_ COLLAT_wr_loop vreg, base, RND_SAT: rnd_sat_param |
The following data types are supported:
Distribution(1)(2) | vreg[r][0] goes to | vreg[r][1] goes to | vreg[r][2] goes to | vreg[r][3] goes to | vreg[r][4] goes to | vreg[r][5] goes to | vreg[r][6] goes to | vreg[r][7] goes to |
---|---|---|---|---|---|---|---|---|
NPT | dptr[0] | dptr[1] | dptr[2] | dptr[3] | dptr[4] | dptr[5] | dptr[6] | dptr[7] |
1PT | dptr[0] | n/a | n/a | n/a | n/a | n/a | n/a | n/a |
DS2 | dptr[0] | n/a | dptr[1] | n/a | dptr[2] | n/a | dptr[3] | n/a |
INTRLV | dptr[0] = vreg[r][0], dptr[1] = vreg[r+1][0] | dptr[2] = vreg[r][1], dptr[3] = vreg[r+1][1] | dptr[4] = vreg[r][2], dptr[5] = vreg[r+1][2] | dptr[6] = vreg[r][3], dptr[7] = vreg[r+1][3] | dptr[8] = vreg[r][4], dptr [9] = vreg[r+1][4] | dptr[10] = vreg[r][5], dptr[11] = vreg[r+1][5] | dptr[12] = vreg[r][6], dptr[13] = vreg[r+1][6] | dptr[14] = vreg[r][7], dptr[15] = vreg[r+1][7] |
OFFST_NP1 | dptr[0] | dptr[9] | dptr[18] | dptr[27] | dptr[36] | dptr[45] | dptr[54] | dptr[63] |
COLLAT | none or *dptr++ | none or *dptr++ | none or *dptr++ | none or *dptr++ | none or *dptr++ | none or *dptr++ | none or *dptr++ | none or *dptr++ |
SDDA/PDDA | dptr[V0[0]] | dptr[V0[1]] | dptr[V0[2]] | dptr[V0[3]] | dptr[V0[4]] | dptr[V0[5]] | dptr[V0[6]] | dptr[V0[7]] |
SKIP | dptr[0] | dptr[2] | dptr[4] | dptr[6] | dptr[8] | dptr[10] | dptr[12] | dptr[14] |
Stores are predicated to perform at the indicated loop level, with wr_loop selectable from:
Rounding and saturation arithmetic is available in the store pipeline to provide rounding and saturation without taking additional cycles.
Rounding and saturation parameters are provided in the parameter file to allow easy adaptation of EVE functions to different rounding and saturation configurations. The rnd_sat_param field in the instruction points to a parameter (instruction encoding limits this field to 5 bits, thus restricting it to P0..P31) that contains the following fields shown in Figure 8-66.
15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sat_mode: 0 = NO_SAT 1 = SYMM 2 = ASYMM 3 = 4PARAM 4 = SYMM32 5 = ASYMM32 | sat_bound_param: index to param that specifies saturation bound | rnd_mode: 0 = no rounding 1 = round 2 = truncate | rnd_shift: number of bits to round/shift down |
The following saturation modes are supported:
Signed/unsigned designation is on the VST type, and signed/unsigned designation affects whether the bound parameter(s) are interpreted as signed or unsigned. The comparison is carried out between signed 40-bit register and such interpreted bounds.
When no rounding and saturation feature is needed, specifying RND_SAT: P0 is working, since constant 0 in these fields represents doing nothing.
Conditional store is specified by the pred field. Omission means no predication. Pred = V1, V2, V3 indicates that register V1, V2, V3 is used to predicate the store of each way of SIMD; store is executed when the register is not zero.
With the COLLAT distribution option, all SIMD ways are examined for conditional store. Truth in the predication register leads to the data item being stored and incrementing of the data pointer by size of the store type (1/2/4 bytes). False in the predication register leads to doing nothing. Thus, stored data items are compacted without leaving any gaps for the blocked data items.
With other distribution options, conditional store behaves differently. Agen is incremented as in unconditional stores. Truth in the predication register leads to the data item being stored, and false in the predication register leads to the data item being skipped. Thus, it is possible for the stored data items to contain gaps among them from the blocked data items.
The store stage hardware has separate resource for each VCOP data memory region (IBUFL, IBUFH, WBUF). Sequential data-driven store is executed in X cycles, X being the number of data items enabled in predication (or when predication is not used, N). The other store modes are executed in a single cycle per store (note that collating store executes in one cycle as well). All memory regions are carried out in parallel, but there is also potential memory contention with the load stage. There is write buffering in the store stage to reduce load/store contention to some degree. See Section 8.3.5.6 for load/store buffering details.