Bitslice DES and SSE5



AMD's SSE5 cancel, AVX-friendly 128-bit and / bit 256 SIMD instruction set extension ofXOPhas announced the. It becomes convenient lot, this content is largely meaningless disappeared.

New page is here

SSE5 Enhanced Bitslice DES

What is this?

To introduce the 32nm process generation AMD K11 (Bulldozer) instruction set extensions supported by micro-architecture is SSE5, pcmov instruction that will be added. This is in AltiVec vsel, Cell BE SPE in the same selb and select operations is to achieve 3-bit input. As an implementation of AltiVec and SPE, and by using it, Bitslice DES can be expected to reduce the number of logical operation gate configured.


Multiple licenses are unlikely because we do not demand too much current.


Only here do we only focused on the optimization reduces the number of bit logical operation only. The operation reflects not the actual throughput. Traditional Bitslice DES might cause more degradation.

Please think about reference only.

Future versions of Tripcode Explorer - Intel and AMD Which is advantageous?

pcmov reduction of the number of instruction S-Box configuration instructions?

For SSE5Bitslice DES S-Boxes (trial version) of the function S4 (low number of first bit logical operation) to 32 gates after optimization, gcc 4.3 compile, Linux x86-64 assembly code let's induce vomiting or. Further reduce the number of instructions, an extra hand movdqa has been cut optimization.

Instruction scheduling is pretty hateful.

 movdqa xmm10, xmm1          movdqa xmm9, xmm4          movdqa xmm6, xmm2          por xmm10, xmm4          movdqa xmm11, xmm0          movdqa xmm7, xmm2          movdqa xmm13, xmm4          movdqa xmm12, xmm10          pcmov xmm7, xmm1, xmm4, xmm7          movdqa xmm14, xmm3          pxor xmm12, XMMWORD PTR [ ALL_ONE]          por xmm9, xmm12          pcmov xmm6, xmm12, xmm1, xmm6          movdqa xmm8, xmm6          pandn xmm8, xmm9          pcmov xmm11, xmm6, xmm8, xmm11          pxor xmm7, xmm8          pcmov xmm2, xmm11, xmm0, xmm2          pxor xmm6, xmm9          movdqa xmm8, xmm0          pcmov xmm8, xmm6, xmm7, xmm8          pcmov xmm7, xmm11, xmm8, xmm7          pxor xmm9, xmm7          pcmov xmm7, xmm6, xmm1, xmm7          pxor xmm13, xmm7          pcmov xmm14, xmm11, xmm8, xmm14          pxor xmm8, xmm2          pcmov xmm9, xmm9, xmm13 , xmm3          pcmov xmm6, xmm7, xmm12, xmm6          pcmov xmm9, xmm9, xmm14, xmm5          pxor xmm6, xmm0          pcmov xmm14, xmm14, xmm9, xmm5          pxor xmm10, xmm8          pxor xmm14, XMMWORD PTR [rcx]          pcmov xmm10, xmm10, xmm11, xmm0          movdqa XMMWORD PTR [rcx], xmm14          pcmov xmm4, xmm4, xmm13, xmm10          pxor xmm2, xmm4          pcmov xmm8, xmm8, xmm10, xmm3          pcmov xmm3, xmm6, xmm2, xmm3          pcmov xmm3, xmm3, xmm8, xmm5          pcmov xmm8, xmm8, xmm6 , xmm5          pxor xmm8, XMMWORD PTR [rdi]          pxor xmm3, xmm5          movdqa XMMWORD PTR [rdi], xmm8          pxor xmm9, xmm5          pxor xmm3, XMMWORD PTR [rsi]          movdqa XMMWORD PTR [rsi], xmm3          pxor xmm9, XMMWORD PTR [rdx]          movdqa XMMWORD PTR [rdx], xmm9 


50 This statement somehow.

John the Ripper is the order of 63 x86-64.S S4 butso much has fallen significantly.

※ xmm0 ~ xmm5 and input parameters, taken as a destination point four pointer should be an equal comparison.

This throughput pcmov pand / pandn / por / pxor be pretty tight if not the same.

AVX, the compiler can generate code at the moment there is no (supported by the latest Intel C + +?) S4 is a function of nonstd.c Kwan avoids the short code in the base. AVX speak is no longer needed the shelter of a temporary variable by 2 input 1-output 3-operand format, you can easily calculate the required number of instructions, and the output value of 41 + XOR gate can be done by order of 49 (variable Because memory is sufficient enough to shelter in need, of course, that trick does not require YMM and retract the upper register.)

Ideally you AVX + SSE5

Shiro evacuation orders to reduce the register by an independent source of destination or do to reduce the number of logical operations, to optimize both not to be useful instead. And it is not extravagant to use both.

And AVX, SSE5 VEX form of true four-operand encoding format pcmov (vpcmov?) 32 +8 is achieved if (and one output value 4 XOR + stores) 済Mimasu in about 40 minimum order.

Most, as I said earlier, the performance comes as it is necessary to assume that the other two source logical operations throughput.

By the way, looks like this.

 vpor xmm10, xmm1, xmm4          vpcmov xmm7, xmm1, xmm4, xmm2          vpxor xmm12, xmm10, XMMWORD PTR [ALL_ONE]          vpor xmm9, xmm4, xmm12          vpcmov xmm6, xmm12, xmm1, xmm2          vpandn xmm8, xmm6, xmm9          vpcmov xmm11, xmm6, xmm8 , xmm0          vpxor xmm7. xmm7, xmm8          vpcmov xmm2, xmm11, xmm0, xmm2          vpxor xmm6, xmm6, xmm9          vpcmov xmm8, xmm6, xmm7, xmm0          vpcmov xmm7, xmm11, xmm8, xmm7          vpxor xmm9, xmm9, xmm7          vpcmov xmm7, xmm6, xmm1 , xmm7          vpxor xmm13, xmm4, xmm7          vpcmov xmm14, xmm11, xmm8, xmm3          vpxor xmm8, xmm8, xmm2          vpcmov xmm9, xmm9, xmm13, xmm3          vpcmov xmm6, xmm7, xmm12, xmm6          vpcmov xmm9, xmm9, xmm14, xmm5          vpxor xmm6, xmm6 , xmm0          vpcmov xmm14, xmm14, xmm9, xmm5          vpxor xmm10, xmm10, xmm8          vpxor xmm14, xmm14, XMMWORD PTR [rcx]          vpcmov xmm10, xmm10, xmm11, xmm0          vmovdqa XMMWORD PTR [rcx], xmm14          vpcmov xmm4, xmm4, xmm13, xmm10          vpxor xmm2, xmm2, xmm4          vpcmov xmm8, xmm8, xmm10, xmm3          vpcmov xmm3, xmm6, xmm2, xmm3          vpcmov xmm3, xmm3, xmm8, xmm5          vpcmov xmm8, xmm8, xmm6, xmm5          vpxor xmm8, xmm8, XMMWORD PTR [rdi]           vpxor xmm3 , xmm3, xmm5          vmovdqa XMMWORD PTR [rdi], xmm8          vpxor xmm9, xmm9, xmm5          vpxor xmm3, xmm3, XMMWORD PTR [rsi]          vmovdqa XMMWORD PTR [rsi], xmm3          vpxor xmm9, xmm9, XMMWORD PTR [rdx]          vmovdqa XMMWORD PTR [ rdx], xmm9 

(2009.05.02 append)

AVX + XOP has the delusion of the individual to the reality of change. The above code will be able to use in practice. Maybe.

No comments:

Post a Comment