SPRUI30H November 2015 – May 2024 DRA745 , DRA746 , DRA750 , DRA756
Various arithmetic and logic operations are available in the compute command, which is indicated by starting the command with:
VLOOP COMP,cmd_len, param_len |
Vector arithmetic/logic instructions have the following assembly formats, depending on whether it is
1-input-1-output, 2-input-1-output, 2-input-2-output, or 3-input-1-output, whether accumulator clearing is enabled, and whether rounding is enabled.
V<op_1i1o> src1, dst
V<op_2i1o> src1, src2, dst
V<op_2i1o> src1, src2, dst, RND: rnd_param
V<op_2i2o> src1/dst1, src2/dst2
V<op_2i2o> src1, src2/dst1, dst2
V<op_3i1o> src1, src2, src3, dst
V<op_3i1o> src1, src2, src3, dst, RND: rnd_param
All operations are 40-bit, except:
The operations in Table 8-354 are supported. See Section 8.3.5.9 for details on each operation.
Operation | #In- Out | #Bits | #Del | Syntax(1) | Note |
---|---|---|---|---|---|
VNOP | 40 | VNOP | |||
VADD | 2-1 | 40 | VADD src1, src2, dst | src1 + src2 | |
VSUB | 2-1 | 40 | VSUB src1, src2, dst | src1 – src2 | |
VABSDIF | 2-1 | 40 | VABSDIF src1, src2, dst | |src1 – src2| | |
VMPY | 2-1 | 17/33 | 1 | VMPY src1, src2, dst, RND:rnd_param | src1 * src2 |
VAND | 2-1 | 40 | VAND src1, src2, dst | src1 & src2 | |
VOR | 2-1 | 40 | VOR src1, src2, dst | src1 | src2 | |
VXOR | 2-1 | 40 | VXOR src1, src2, dst | src2 ^ src2 | |
VMIN | 2-1 | 33 | VMIN src1, src2, dst | min(src1, src2) | |
VMAX | 2-1 | 33 | VMAX src1, src2, dst | max(src1, src2) | |
VANDN | 2-1 | 40 | VANDN src1, src2, dst | src1 & (~src2) | |
VSHF | 2-1 | 40/6 | VSHF src1, src2, dst | src1 << src2, or src1 >> (-src2) | |
VRND | 2-1 | 40/5 | 1 | VRND src1, src2, dst | (src1 + (1 << (src2-1)) >> src2 |
VCMPEQ | 2-1 | 40 | VCMPEQ src1, src2, dst | (src1 == src2) ? 1 : 0 | |
VCMPGT | 2-1 | 40 | VCMPGT src1, src2, dst | (src1 > src2) ? 1 : 0 | |
VCMPGE | 2-1 | 40 | VCMPGE src1, src2, dst | (src1 >= src2) ? 1 : 0 | |
VBINLOG | 1-1 | 32 | 1 | VBINLOG src1, dst | approximate binary log |
VBITC | 1-1 | 32 | 1 | VBITC src1, dst | count one bits |
VNOT | 1-1 | 40 | VNOT src1, dst | ~src1 | |
VMADD | 3-1 | 17/40 | 1 / 2 | VMADD src1, src2, src3, dst, RND: rnd_param | src3 + src1 * src2 |
VMSUB | 3-1 | 17/40 | 1 / 2 | VMSUB src1, src2, src3, dst, RND: rnd_param | src3 – src1 * src2 |
VADD3 | 3-1 | 40 | 1 | VADD3 src1, src2, src3, dst | src1 + src2 + src3 |
VSAD | 3-1 | 40 | 1 | VSAD src1, src2, src3, dst | src3 + abs(src1 – src2) |
VSEL | 3-1 | 40 | VSEL src1, src2, src3, dst | src1 ? src2 : src3 | |
VAND3 | 3-1 | 40 | 1 | VAND3 src1, src2, src3, dst | src1 & src2 & src3 |
VOR3 | 3-1 | 40 | 1 | VOR3 src1, src2, src3, dst | src1 | src2 | src3 |
VSHFOR | 3-1 | 40/6 | 1 | VSHFOR src1, src2, src3, dst | src3 | (src1 << src2) or src3 | (src1 >> -src2) |
VSORT2 | 2-2 | 33 | VSORT2 src1/dst1, src2/dst2 | dst1 = min(src1, src2) dst2 = max(src1, src2) | |
VBITPK | 2-1 | 33 | 1 | VBITPK src1, src2, dst | compare, bit-pack, broadcast |
VBITUNPK | 2-1 | 40 | 1 | VBITUNPK src1, src2, dst | bit unpack |
VEXITNZ | 40 | VEXITNZ level, src1 | exit loop at end of iteration when (src1 != 0) | ||
VCMOV | 40 | VCMOV cond, src1, dst | conditional move | ||
VBITR | 1-1 | 32 | 1 | VBITR src1, dst | bit reverse |
VBITI | 2-1 | 32 | 1 | VBITI src1, src2, dst | bit interleave |
VBITDI | 1-2 | 32 | VBITI src1, dst1, dst2 | bit deinterleave | |
VABS | 1-1 | 40 | VABS src1, dst | abs(src1) | |
VADDH | 2-1 | 40 | VADDH src1, src2, dst | src1+ signext(src2[39:32]) | |
VLMBD | 3-1 | 40 | 1 | VLMBD src1, src2, dst | left-most-bit-detect |
VBITTR | 1-1 | NSIMD | 1 | VBITTR src1, dst | bit transpose |
VSIGN | 2-1 | 40 | VSIGN src1, src2, dst | apply sign of src1 on src2 | |
VADDSUB | 2-2 | 40 | VADDSUB src1/dst1, src2/dst2 | dst1 = src1 + src2 dst2 = src1 – src2 | |
VINTRLV | 2-2 | 40 | VINTRLV src1/dst1, src2/dst2 | interleave | |
VDINTRLV | 2-2 | 40 | VDINTRLV src1/dst1, src2/dst2 | deinterleave | |
VMINSETF | 2-2 | 33 | VMINSETF src1, src2/dst1, dst2 | minimum and set flag | |
VMAXSETF | 2-2 | 33 | VMAXSETFsrc1, src2/dst1, dst2 | maximum and set flag | |
VINTRLV2 | 2-2 | 40 | VINTRLV2 src1/dst1, src2/dst2 | interleave with 2-element frequency | |
VDINTRLV2 | 2-2 | 40 | VDINTRLV2 src1/dst1, src2/dst2 | deinterleave with 2-element frequency | |
VINTRLV4 | 2-2 | 40 | VINTRLV2 src1/dst1, src2/dst2 | interleave with 4-element frequency | |
VSHF16 | 1-2 | 33 | VSHF16 src1, dst1, dst2 | shift up 16 bits into 2 registers | |
VADIF3 | 3-1 | 40 | 1 | VADIF3 src1, src2, src3, dst | add difference, dst = src1 – src2 + src3 |
VSWAP | 3-2 | 40 | VSWAP cond, src1/dst1, src2/dst2 | conditional swap |
Operations with two destinations must use two different destination registers, otherwise the outcome is undefined.
The accumulating register (being a source as well as the destination of an operation) must be src1 for
2-input-1-output operations, and must be src3 for 3-input-1-output operations.
The rounding parameter is needed for VMPY, VMADD and VMSUB instructions, and specifies an index to the parameter file, see Section 8.3.5.9.
EVE hardware executes up to 2 operations in parallel per clock cycle. Assembly program (by programmer or by compiler) contains the parallel bar notation to indicate if an instruction is to be executed by itself, or is to be executed in parallel with another instruction.
VMADD, VMSUB instruction have two delay slots for the multiplication input, and one delay slot for the addition/subtraction input.
VMPY, VRND, VBINLOG, VBITC, VADD3, VSAD, VAND3, VOR3, VSHFOR, VBITPK, VBITUNPK, VBITR, VBITI, VLMBD, VBITTR, VADIF3 instructions have one delay slot.
All other operations do not have delay slots.
The hardware detects and treats write/read dependency between any two sequential instructions, but not inside parallel executed instruction pairs. When necessary for correctness, hardware inserts idle cycles automatically. Still to achieve good performance, software must try to schedule the operations to avoid automatic idle cycles.
There is a forwarding path within each functional unit and between the two functional units to forward from destination to source 3 (accumulator input) of 3-input-1-output operations.
Register forwarding, dependency checking and automatic idle cycle insertion work across iterations as well, from end of one iteration to the beginning of the next iteration. For example, the FIR filtering kernel executes in one cycle per iteration with the destination-to-source3 dependency across iterations.
Table 8-355 shows delay slots and automatically inserted idle cycles.
Time | VMPY | VMADD | VSUB | VMSUB | VAND | VOR | VADD |
---|---|---|---|---|---|---|---|
0 | V0*V1 | ||||||
1 | V0*V1 => V7 | V2*V3 | V2-V6 => V6 | ||||
2 | V2*V3 =>p | V4*V5 | V3 & V6 => V6 | ||||
3 | V7 + p => V7 | V4*V5 => p | wait for V7 | ||||
4 | V7 - p => V7 | wait for V7 | |||||
5 | V1|V8 => V9 | V6 + V7 => V6 |
Zero-delay slot operation executes in one cycle. VMPY, an one-delay slot opeartion takes two cycles to execute. VMADD and VMSUB, two-delay slot operations take 3 cycles for execution, but need its additional input only on the third cycle. The dependency between the two instructions (VMADD-VMSUB) on the additional operand does not introduce idle cycles, but VMSUB-VADD dependency adds 2 idle cycles to execution per iteration.