I was curious enough to test performance for SHA-1 today.As I expected bitalign usage even more noticeable for SHA-1 than for MD5.Theoretically speed-up can be as large as 50%, however as always there are some details.
At first, my SHA-1 wasn't good enough at ighashgpu v0.62.By slightly changing algorithm I've got 15% better results.Then I've added bitalign - another 40% for HD 5XXX and finally I've removed last 4 rounds from SHA-1 ("reversed" in other words).Last optimization was already done earlier for CUDA code, now I've just applied it to ATI code.It's another 5%.So, all in all, performance for single SHA-1 hashes at HD 5XXX now 71% better than it was.Impressive, isn't it?.
As I Feel Lazy to Test all These major changes and big speed-ups I've decided Ongoing to Release Intermediate version of ighashgpu (Call it alpha?), You CAN Download it here .Not all kernels changed, basically only single MD5 and all SHA-1 related ones updated.By the way, ATI version can now supports passwords (+ optional salt) up to 48 symbols long (== Joomla).nVidia code wasn't updated for this.
And one more thing about nVidia CUDA code, I've changed a bit the way passwords distributed among threads / blocks.As a result, there is small speed-up, like 2-3% for all CUDA kernels.It doesn't looks like huge thing but when utilization already over 95% these 2-3% are very nice actually.
Also, I've finally fixed / sf + / m usage bug (I hope so, at least).
So, I'm interesting in some feedback, especially results on 5970's and GTX285/295.
To introduce the 32nm process generation AMD K11 (Bulldozer) instruction set extensions supported by microarchitectureXOPis, vpcmov instruction that will be added.This is in AltiVec vsel, Cell BE SPE in the same selb and select operations is to achieve 3-bit input.As an implementation of AltiVec and SPE, and by using it, Bitslice DES reduce the number of gates can be expected to configure the logic operation.
AVX version 3 of the SIMD operands by using a logical instruction bit extra instruction while reducing MOVDQA further by reducing the number of VPCMOV logical operation can be a greater density of the operation.
Public Notes 2 clause BSD-style license applies.Multiple licenses are unlikely because we do not demand too much current.Freely available for license.
Only here do we only focused on the optimization reduces the number of bit logical operation only.The operation reflects not the actual throughput.Traditional Bitslice DES might cause more degradation.Please think about reference only.
To introduce the 32nm process generation AMD K11 (Bulldozer) instruction set extensions supported by micro-architecture is SSE5, pcmov instruction that will be added.This is in AltiVec vsel, Cell BE SPE in the same selb and select operations is to achieve 3-bit input.As an implementation of AltiVec and SPE, and by using it, Bitslice DES can be expected to reduce the number of logical operation gate configured.
Multiple licenses are unlikely because we do not demand too much current.
Only here do we only focused on the optimization reduces the number of bit logical operation only.The operation reflects not the actual throughput.Traditional Bitslice DES might cause more degradation.
Please think about reference only.
Future versions of Tripcode Explorer - Intel and AMD Which is advantageous?
pcmov reduction of the number of instruction S-Box configuration instructions?
For SSE5Bitslice DES S-Boxes (trial version) of the function S4 (low number of first bit logical operation) to 32 gates after optimization, gcc 4.3 compile, Linux x86-64 assembly code let's induce vomiting or.Further reduce the number of instructions, an extra hand movdqa has been cut optimization.
※ xmm0 ~ xmm5 and input parameters, taken as a destination point four pointer should be an equal comparison.
This throughput pcmov pand / pandn / por / pxor be pretty tight if not the same.
AVX, the compiler can generate code at the moment there is no (supported by the latest Intel C + +?) S4 is a function of nonstd.c Kwan avoids the short code in the base.AVX speak is no longer needed the shelter of a temporary variable by 2 input 1-output 3-operand format, you can easily calculate the required number of instructions, and the output value of 41 + XOR gate can be done by order of 49 (variable Because memory is sufficient enough to shelter in need, of course, that trick does not require YMM and retract the upper register.)
Ideally you AVX + SSE5
Shiro evacuation orders to reduce the register by an independent source of destination or do to reduce the number of logical operations, to optimize both not to be useful instead.And it is not extravagant to use both.
And AVX, SSE5 VEX form of true four-operand encoding format pcmov (vpcmov?) 32 +8 is achieved if (and one output value 4 XOR + stores) 済Mimasu in about 40 minimum order.
Most, as I said earlier, the performance comes as it is necessary to assume that the other two source logical operations throughput.