SPRUI30H November 2015 – May 2024 DRA745 , DRA746 , DRA750 , DRA756
Table lookup operation is indicated by starting the vector command with
VLOOP TLU, CL#:cmd_len, PL#: param_len |
The table lookup operation uses a subset of the available resources and instructions:
The table pointer is allowed to move via agen, to allow loop-variable-dependent manipulation of table base. For example, software might process three color components (RGB or YUV), each having a different table, in the same command, and use the outer loop variables to index among the color components and the corresponding tables.
Data is loaded with normal load into V2, for example:
VLDBU_1PT data_base[A0], V2
The data load using VLD has access to a subset of load distribution options: {NPT, 1PT, DS2, US2}.
A special table load, VTLD, is used for the lookup:
VTLDtype_m TBL_ nPT tbl_base[tbl_agen][V2], V0, RND_SAT:rnd_sat |
Normally lookups only fetch one item per lookup. Multiple items are useful for bilinear and bicubic interpolations. The _mTBL field in VTLD specifies the number of parallel tables and the _nPT field in VTLD specifies the number of data items per lookup table.
There are constraints on num_par_tbl, num_data_per_lu, and the table data size:
Constraints for 8-way SIMD architecture are shown in Table 8-357.
Table type | Num items per lookup, num_data_per_lu | Number of parallel tables, num_par_tbl | |||
---|---|---|---|---|---|
1 | 2 | 4 | 8 | ||
Byte | 1 | √ | √ | √ | √ |
2 | √ | √ | √ | ||
4 | √ | √ | |||
8 | √ | ||||
Half word | 1 | √ | √ | √ | √ |
2 | √ | √ | √ | ||
4 | √ | √ | |||
8 | √ | ||||
Word | 1 | √ | √ | √ | √ |
2 | √ | √ | √ | ||
4 | √ | √ | |||
8 | √ |
Results are stored back to memory with a normal data store as in compute loop with simplification:
VSTtype_distribution_ALWS V0, base[agen], RND_SAT: rnd_sat_param
This VST has access to a subset of store distribution optons: {NPT | SKIP}.
Number of data points that are stored per i4 iteration is num_data_per_lu × num_par_tbl from the table load, and this is an implicit predication.
For example, when there is table load VTLDHU_1TBL_2PT, the store VSTHU_NPT is not storing N points, but just 2 point.
In forming the indices for lookup, data are read in, rounded and saturated. Rounding and saturation parameters are specified in VTLD’s RND_SAT: rnd_sat field. This field is 5-bit (like VST), so only P0..P31 is allowed. The same set of round/saturate parameters as the normal store, VST, are used. VTLD’s RND_SAT is for round/saturate the data to form table lookup index. VST also contains a RND_SAT field, which is used to round/saturate looked up table entries before storing back to data memory.
Various other table lookup parameters are extracted from the instructions:
V0[0] = tbl_base[data0], V0[1] = tbl_base[data0 + 1],
V0[2] = tbl_base[data0+2], V0[3] = tbl_base[data0 + 3],
V0[4] = tbl_base[data1], V0[5] = tbl_base[data1 + 1],
V0[6] = tbl_base[data1+2], V0[7] = tbl_base[data1 + 3]
Table organization in memory depends on table type and num_par_tbl. The N physical banks of 32-bit memory is partitioned into num_par_tbl logical banks, and each parallel table is organized inside its own bank.
For example, when NWAY = 8, you can have 1, 2, 4, or 8 parallel tables. Data organization for these and byte, halfword, and word type entries is shown in Figure 8-67.
A few examples of the table lookup operation follow. The NPT distribution for load and store works for normal lookup scenarios.
; single table, byte data, short table/output
VLOOP TLU, #cmd_len, #param_len
VAGEN A0, ...
VAGEN A1, ...
VAGEN A2, ...
VLDBU_1PT data_base[A0], V2 ; load data to V0
VTLDHU_1TBL_1PT table_base[A1][V2], V0, RND_SAT: rnd_sat ; look up
VSTH_NPT_ALWS V0, outp_base[A2], RND_SAT: rnd_sat_st ; store outcome
; 4 parallel tables, 2 items per table, short data, byte table/output
VLOOP TLU, #cmd_len, #param_len
VAGEN A0, ...
VAGEN A1, ...
VAGEN A2, ...
VLDHU_NPT data_base[A0], V2 ; load data to V0
VTLDBU_4TBL_2PT table_base[A1][V2], V0, RND_SAT: rnd_sat ; look up
VSTB_NPT_ALWS V0, outp_base[A2], RND_SAT: rnd_sat_st ; store outcome
Load with expansion can only be used in a table lookup loop. An example follows.
; load with expansion, short input data, unsigned byte flags, short expanded output
VLOOP TLU, #cmd_len, #param_len
VAGEN A0, ...
VAGEN A2, ...
VLDBU_NPT flag_base[A0], V2 ; load flags to V2
VLDH_EXP input_base, V0 ; load with expansion
VSTH_NPT_ALWS V0, outp_base[A2], RND_SAT: rnd_sat_st ; store outcome
For example, in a particular iteration, suppose there are flags V2 = {0, 0, 1, 0, 1, 1, 0, 0}, and the expanding load pointer = 0x100 before the expanding load. After the expanding load:
V0 = {0, 0, mem[0x100], 0, mem[0x102], mem[0x104], 0, 0}
The pointer is advanced to 0x106, incremented by 3 (number of nonzero flags) times the size of data type (2 bytes for halfword).
Optionally, predicated store can be used while writing the expanded outcome array, leaving flag==0 data points unaltered.
VLOOP TLU, #cmd_len, #param_len
VAGEN A0, ...
VAGEN A2, ...
VLDBU_NPT flag_base[A0], V2 ; load flags to V2
VLDH_EXP input_base, V0 ; load with expansion
[V2] VSTH_NPT_ALWS V0, outp_base[A2], RND_SAT: rnd_sat_st ; store where V2 != 0
Performance of load with expansion is NWAY expanded data points per cycle when there is no memory contention, by pipelining the flag load, data load, and output store.