Disclaimer
AMD's SSE5 cancel, AVX-friendly 128-bit and / bit 256 SIMD instruction set extension ofXOPhas announced the. It becomes convenient lot, this content is largely meaningless disappeared.
New page is here
SSE5 Enhanced Bitslice DES
What is this?
To introduce the 32nm process generation AMD K11 (Bulldozer) instruction set extensions supported by micro-architecture is SSE5, pcmov instruction that will be added. This is in AltiVec vsel, Cell BE SPE in the same selb and select operations is to achieve 3-bit input. As an implementation of AltiVec and SPE, and by using it, Bitslice DES can be expected to reduce the number of logical operation gate configured.
Download
Multiple licenses are unlikely because we do not demand too much current.
- 2nd edition (BSD-style license) (2009.01.02 update)
Caution
Only here do we only focused on the optimization reduces the number of bit logical operation only. The operation reflects not the actual throughput. Traditional Bitslice DES might cause more degradation.
Please think about reference only.
Future versions of Tripcode Explorer - Intel and AMD Which is advantageous?
pcmov reduction of the number of instruction S-Box configuration instructions?
For SSE5Bitslice DES S-Boxes (trial version) of the function S4 (low number of first bit logical operation) to 32 gates after optimization, gcc 4.3 compile, Linux x86-64 assembly code let's induce vomiting or. Further reduce the number of instructions, an extra hand movdqa has been cut optimization.
Instruction scheduling is pretty hateful.
movdqa xmm10, xmm1 movdqa xmm9, xmm4 movdqa xmm6, xmm2 por xmm10, xmm4 movdqa xmm11, xmm0 movdqa xmm7, xmm2 movdqa xmm13, xmm4 movdqa xmm12, xmm10 pcmov xmm7, xmm1, xmm4, xmm7 movdqa xmm14, xmm3 pxor xmm12, XMMWORD PTR [ ALL_ONE] por xmm9, xmm12 pcmov xmm6, xmm12, xmm1, xmm6 movdqa xmm8, xmm6 pandn xmm8, xmm9 pcmov xmm11, xmm6, xmm8, xmm11 pxor xmm7, xmm8 pcmov xmm2, xmm11, xmm0, xmm2 pxor xmm6, xmm9 movdqa xmm8, xmm0 pcmov xmm8, xmm6, xmm7, xmm8 pcmov xmm7, xmm11, xmm8, xmm7 pxor xmm9, xmm7 pcmov xmm7, xmm6, xmm1, xmm7 pxor xmm13, xmm7 pcmov xmm14, xmm11, xmm8, xmm14 pxor xmm8, xmm2 pcmov xmm9, xmm9, xmm13 , xmm3 pcmov xmm6, xmm7, xmm12, xmm6 pcmov xmm9, xmm9, xmm14, xmm5 pxor xmm6, xmm0 pcmov xmm14, xmm14, xmm9, xmm5 pxor xmm10, xmm8 pxor xmm14, XMMWORD PTR [rcx] pcmov xmm10, xmm10, xmm11, xmm0 movdqa XMMWORD PTR [rcx], xmm14 pcmov xmm4, xmm4, xmm13, xmm10 pxor xmm2, xmm4 pcmov xmm8, xmm8, xmm10, xmm3 pcmov xmm3, xmm6, xmm2, xmm3 pcmov xmm3, xmm3, xmm8, xmm5 pcmov xmm8, xmm8, xmm6 , xmm5 pxor xmm8, XMMWORD PTR [rdi] pxor xmm3, xmm5 movdqa XMMWORD PTR [rdi], xmm8 pxor xmm9, xmm5 pxor xmm3, XMMWORD PTR [rsi] movdqa XMMWORD PTR [rsi], xmm3 pxor xmm9, XMMWORD PTR [rdx] movdqa XMMWORD PTR [rdx], xmm9
Grumble
50 This statement somehow.
John the Ripper is the order of 63 x86-64.S S4 butso much has fallen significantly.
※ xmm0 ~ xmm5 and input parameters, taken as a destination point four pointer should be an equal comparison.
This throughput pcmov pand / pandn / por / pxor be pretty tight if not the same.
AVX, the compiler can generate code at the moment there is no (supported by the latest Intel C + +?) S4 is a function of nonstd.c Kwan avoids the short code in the base. AVX speak is no longer needed the shelter of a temporary variable by 2 input 1-output 3-operand format, you can easily calculate the required number of instructions, and the output value of 41 + XOR gate can be done by order of 49 (variable Because memory is sufficient enough to shelter in need, of course, that trick does not require YMM and retract the upper register.)
Ideally you AVX + SSE5
Shiro evacuation orders to reduce the register by an independent source of destination or do to reduce the number of logical operations, to optimize both not to be useful instead. And it is not extravagant to use both.
And AVX, SSE5 VEX form of true four-operand encoding format pcmov (vpcmov?) 32 +8 is achieved if (and one output value 4 XOR + stores) 済Mimasu in about 40 minimum order.
Most, as I said earlier, the performance comes as it is necessary to assume that the other two source logical operations throughput.
By the way, looks like this.
vpor xmm10, xmm1, xmm4 vpcmov xmm7, xmm1, xmm4, xmm2 vpxor xmm12, xmm10, XMMWORD PTR [ALL_ONE] vpor xmm9, xmm4, xmm12 vpcmov xmm6, xmm12, xmm1, xmm2 vpandn xmm8, xmm6, xmm9 vpcmov xmm11, xmm6, xmm8 , xmm0 vpxor xmm7. xmm7, xmm8 vpcmov xmm2, xmm11, xmm0, xmm2 vpxor xmm6, xmm6, xmm9 vpcmov xmm8, xmm6, xmm7, xmm0 vpcmov xmm7, xmm11, xmm8, xmm7 vpxor xmm9, xmm9, xmm7 vpcmov xmm7, xmm6, xmm1 , xmm7 vpxor xmm13, xmm4, xmm7 vpcmov xmm14, xmm11, xmm8, xmm3 vpxor xmm8, xmm8, xmm2 vpcmov xmm9, xmm9, xmm13, xmm3 vpcmov xmm6, xmm7, xmm12, xmm6 vpcmov xmm9, xmm9, xmm14, xmm5 vpxor xmm6, xmm6 , xmm0 vpcmov xmm14, xmm14, xmm9, xmm5 vpxor xmm10, xmm10, xmm8 vpxor xmm14, xmm14, XMMWORD PTR [rcx] vpcmov xmm10, xmm10, xmm11, xmm0 vmovdqa XMMWORD PTR [rcx], xmm14 vpcmov xmm4, xmm4, xmm13, xmm10 vpxor xmm2, xmm2, xmm4 vpcmov xmm8, xmm8, xmm10, xmm3 vpcmov xmm3, xmm6, xmm2, xmm3 vpcmov xmm3, xmm3, xmm8, xmm5 vpcmov xmm8, xmm8, xmm6, xmm5 vpxor xmm8, xmm8, XMMWORD PTR [rdi] vpxor xmm3 , xmm3, xmm5 vmovdqa XMMWORD PTR [rdi], xmm8 vpxor xmm9, xmm9, xmm5 vpxor xmm3, xmm3, XMMWORD PTR [rsi] vmovdqa XMMWORD PTR [rsi], xmm3 vpxor xmm9, xmm9, XMMWORD PTR [rdx] vmovdqa XMMWORD PTR [ rdx], xmm9
(2009.05.02 append)
AVX + XOP has the delusion of the individual to the reality of change. The above code will be able to use in practice. Maybe.
No comments:
Post a Comment