To introduce the 32nm process generation AMD K11 (Bulldozer) instruction set extensions supported by micro-architecture is SSE5, pcmov instruction that will be added.This is in AltiVec vsel, Cell BE SPE in the same selb and select operations is to achieve 3-bit input.As an implementation of AltiVec and SPE, and by using it, Bitslice DES can be expected to reduce the number of logical operation gate configured.
Multiple licenses are unlikely because we do not demand too much current.
Only here do we only focused on the optimization reduces the number of bit logical operation only.The operation reflects not the actual throughput.Traditional Bitslice DES might cause more degradation.
Please think about reference only.
Future versions of Tripcode Explorer - Intel and AMD Which is advantageous?
pcmov reduction of the number of instruction S-Box configuration instructions?
For SSE5Bitslice DES S-Boxes (trial version) of the function S4 (low number of first bit logical operation) to 32 gates after optimization, gcc 4.3 compile, Linux x86-64 assembly code let's induce vomiting or.Further reduce the number of instructions, an extra hand movdqa has been cut optimization.
※ xmm0 ~ xmm5 and input parameters, taken as a destination point four pointer should be an equal comparison.
This throughput pcmov pand / pandn / por / pxor be pretty tight if not the same.
AVX, the compiler can generate code at the moment there is no (supported by the latest Intel C + +?) S4 is a function of nonstd.c Kwan avoids the short code in the base.AVX speak is no longer needed the shelter of a temporary variable by 2 input 1-output 3-operand format, you can easily calculate the required number of instructions, and the output value of 41 + XOR gate can be done by order of 49 (variable Because memory is sufficient enough to shelter in need, of course, that trick does not require YMM and retract the upper register.)
Ideally you AVX + SSE5
Shiro evacuation orders to reduce the register by an independent source of destination or do to reduce the number of logical operations, to optimize both not to be useful instead.And it is not extravagant to use both.
And AVX, SSE5 VEX form of true four-operand encoding format pcmov (vpcmov?) 32 +8 is achieved if (and one output value 4 XOR + stores) 済Mimasu in about 40 minimum order.
Most, as I said earlier, the performance comes as it is necessary to assume that the other two source logical operations throughput.