11/25/2010

Wire-speed processor

The PowerPC A2 is a massively multicore capable and multithreaded 64-bit Power Architecture processor core designed by IBM using the Power ISA v.2.06 specification. IBM calls products based on it PowerEN (Power Edge of Network) or a "wire-speed processor" and they are designed as hybrids between regular networking processors, doing switching and routing and a typical server processor, that is manipulating and packaging data. It was revealed February 8 2010, at ISSCC 2010.
Versions of processors based on the A2 core range from a 2.3 GHz version with 16 cores doing 65 W to a less powerful, four core version, using 20 W at 1.4 GHz. Each A2 core is capable of four-way multithreading. Each chip has 8 MB of cache as well a multitude of task specific engines besides the general purpose processors, such as XML, cryptography, compression and regular expression accelerators, four 10 Gigabit Ethernet ports and two PCIe lanes. Up to four chips can be linked in a SMP system without any additional support chips.
The chips are said to be extremely complex, and uses 1.43 billion transistors, on a die size of 428 mm² fabricated on a 45 nm process. The processors are in a late development stage and finalized products will be available at a later, unknown date. IBM says it will market the processors to customers.





reference : wikipedia -  http://en.wikipedia.org/wiki/PowerPC_A2
                 A Wire-Speed Power Processor: 2.3GHz 45nm SOI with 16 Cores and 64 Threads ? Presentation, IBM - http://www.power.org/events/2010_ISSCC/Wire_Speed_Presentation_5.5_-_Final4.pdf

Supercom - The Green500 List - November 2010

Rank 1) Blue Gene/Q : 17core wire-speed processor(PowerPC A2)
Rank 2) Tsubame 2.0: 2816 - 6 core xeon, 4200 - nvidia m2050, ? - 8 core xoen



5/03/2010

4/20/2010

Results for SHA1 & MD5 on HD5870 and new version of ighashgpu

AMD HD 5870 - SHA1 : 1350M/sec
AMD HD 5870 - MD5 : 3185M/sec




translate.googleusercontent.com/translate_c?hl=ru&ie=UTF-8&sl=ru&tl=en&u=http://www.golubev.com/blog/%3Fp%3D20&rurl=translate.google.com&usg=ALkJrhhhUMCtO7Sn4NE0kFVTz-QZvCYxzA


I was curious enough to test performance for SHA-1 today. As I expected bitalign usage even more noticeable for SHA-1 than for MD5. Theoretically speed-up can be as large as 50%, however as always there are some details.
At first, my SHA-1 wasn't good enough at ighashgpu v0.62. By slightly changing algorithm I've got 15% better results. Then I've added bitalign - another 40% for HD 5XXX and finally I've removed last 4 rounds from SHA-1 ("reversed" in other words). Last optimization was already done earlier for CUDA code, now I've just applied it to ATI code. It's another 5%. So, all in all, performance for single SHA-1 hashes at HD 5XXX now 71% better than it was.Impressive, isn't it? :) .
As I Feel Lazy to Test all These major changes and big speed-ups I've decided Ongoing to Release Intermediate version of ighashgpu (Call it alpha?), You CAN Download it here . Not all kernels changed, basically only single MD5 and all SHA-1 related ones updated. By the way, ATI version can now supports passwords (+ optional salt) up to 48 symbols long (== Joomla). nVidia code wasn't updated for this.
And one more thing about nVidia CUDA code, I've changed a bit the way passwords distributed among threads / blocks. As a result, there is small speed-up, like 2-3% for all CUDA kernels. It doesn't looks like huge thing but when utilization already over 95% these 2-3% are very nice actually.
Also, I've finally fixed / sf + / m usage bug (I hope so, at least).
So, I'm interesting in some feedback, especially results on 5970's and GTX285/295.

4/19/2010

Bitslice DES for AVX / XOP

dango.chu.jp/hiki/?Bitslice+XOP#googtrans/ja/en

AMD-XOP Enhanced Bitslice DES

What is this?

To introduce the 32nm process generation AMD K11 (Bulldozer) instruction set extensions supported by microarchitectureXOPis, vpcmov instruction that will be added. This is in AltiVec vsel, Cell BE SPE in the same selb and select operations is to achieve 3-bit input. As an implementation of AltiVec and SPE, and by using it, Bitslice DES reduce the number of gates can be expected to configure the logic operation.

AVX version 3 of the SIMD operands by using a logical instruction bit extra instruction while reducing MOVDQA further by reducing the number of VPCMOV logical operation can be a greater density of the operation.

Download

Public Notes 2 clause BSD-style license applies. Multiple licenses are unlikely because we do not demand too much current. Freely available for license.

Caution

Only here do we only focused on the optimization reduces the number of bit logical operation only. The operation reflects not the actual throughput. Traditional Bitslice DES might cause more degradation. Please think about reference only.

Reference

Version of the old specification (SSE5)

Bitslice DES and SSE5

dango.chu.jp/hiki/?SSE5+and+Bitslice+DES#googtrans/ja/en

Disclaimer

AMD's SSE5 cancel, AVX-friendly 128-bit and / bit 256 SIMD instruction set extension ofXOPhas announced the. It becomes convenient lot, this content is largely meaningless disappeared.

New page is here

SSE5 Enhanced Bitslice DES

What is this?

To introduce the 32nm process generation AMD K11 (Bulldozer) instruction set extensions supported by micro-architecture is SSE5, pcmov instruction that will be added. This is in AltiVec vsel, Cell BE SPE in the same selb and select operations is to achieve 3-bit input. As an implementation of AltiVec and SPE, and by using it, Bitslice DES can be expected to reduce the number of logical operation gate configured.

Download

Multiple licenses are unlikely because we do not demand too much current.

Caution

Only here do we only focused on the optimization reduces the number of bit logical operation only. The operation reflects not the actual throughput. Traditional Bitslice DES might cause more degradation.

Please think about reference only.

Future versions of Tripcode Explorer - Intel and AMD Which is advantageous?

pcmov reduction of the number of instruction S-Box configuration instructions?

For SSE5Bitslice DES S-Boxes (trial version) of the function S4 (low number of first bit logical operation) to 32 gates after optimization, gcc 4.3 compile, Linux x86-64 assembly code let's induce vomiting or. Further reduce the number of instructions, an extra hand movdqa has been cut optimization.

Instruction scheduling is pretty hateful.

 movdqa xmm10, xmm1          movdqa xmm9, xmm4          movdqa xmm6, xmm2          por xmm10, xmm4          movdqa xmm11, xmm0          movdqa xmm7, xmm2          movdqa xmm13, xmm4          movdqa xmm12, xmm10          pcmov xmm7, xmm1, xmm4, xmm7          movdqa xmm14, xmm3          pxor xmm12, XMMWORD PTR [ ALL_ONE]          por xmm9, xmm12          pcmov xmm6, xmm12, xmm1, xmm6          movdqa xmm8, xmm6          pandn xmm8, xmm9          pcmov xmm11, xmm6, xmm8, xmm11          pxor xmm7, xmm8          pcmov xmm2, xmm11, xmm0, xmm2          pxor xmm6, xmm9          movdqa xmm8, xmm0          pcmov xmm8, xmm6, xmm7, xmm8          pcmov xmm7, xmm11, xmm8, xmm7          pxor xmm9, xmm7          pcmov xmm7, xmm6, xmm1, xmm7          pxor xmm13, xmm7          pcmov xmm14, xmm11, xmm8, xmm14          pxor xmm8, xmm2          pcmov xmm9, xmm9, xmm13 , xmm3          pcmov xmm6, xmm7, xmm12, xmm6          pcmov xmm9, xmm9, xmm14, xmm5          pxor xmm6, xmm0          pcmov xmm14, xmm14, xmm9, xmm5          pxor xmm10, xmm8          pxor xmm14, XMMWORD PTR [rcx]          pcmov xmm10, xmm10, xmm11, xmm0          movdqa XMMWORD PTR [rcx], xmm14          pcmov xmm4, xmm4, xmm13, xmm10          pxor xmm2, xmm4          pcmov xmm8, xmm8, xmm10, xmm3          pcmov xmm3, xmm6, xmm2, xmm3          pcmov xmm3, xmm3, xmm8, xmm5          pcmov xmm8, xmm8, xmm6 , xmm5          pxor xmm8, XMMWORD PTR [rdi]          pxor xmm3, xmm5          movdqa XMMWORD PTR [rdi], xmm8          pxor xmm9, xmm5          pxor xmm3, XMMWORD PTR [rsi]          movdqa XMMWORD PTR [rsi], xmm3          pxor xmm9, XMMWORD PTR [rdx]          movdqa XMMWORD PTR [rdx], xmm9 

Grumble

50 This statement somehow.

John the Ripper is the order of 63 x86-64.S S4 butso much has fallen significantly.

※ xmm0 ~ xmm5 and input parameters, taken as a destination point four pointer should be an equal comparison.

This throughput pcmov pand / pandn / por / pxor be pretty tight if not the same.

AVX, the compiler can generate code at the moment there is no (supported by the latest Intel C + +?) S4 is a function of nonstd.c Kwan avoids the short code in the base. AVX speak is no longer needed the shelter of a temporary variable by 2 input 1-output 3-operand format, you can easily calculate the required number of instructions, and the output value of 41 + XOR gate can be done by order of 49 (variable Because memory is sufficient enough to shelter in need, of course, that trick does not require YMM and retract the upper register.)

Ideally you AVX + SSE5

Shiro evacuation orders to reduce the register by an independent source of destination or do to reduce the number of logical operations, to optimize both not to be useful instead. And it is not extravagant to use both.

And AVX, SSE5 VEX form of true four-operand encoding format pcmov (vpcmov?) 32 +8 is achieved if (and one output value 4 XOR + stores) 済Mimasu in about 40 minimum order.

Most, as I said earlier, the performance comes as it is necessary to assume that the other two source logical operations throughput.

By the way, looks like this.

 vpor xmm10, xmm1, xmm4          vpcmov xmm7, xmm1, xmm4, xmm2          vpxor xmm12, xmm10, XMMWORD PTR [ALL_ONE]          vpor xmm9, xmm4, xmm12          vpcmov xmm6, xmm12, xmm1, xmm2          vpandn xmm8, xmm6, xmm9          vpcmov xmm11, xmm6, xmm8 , xmm0          vpxor xmm7. xmm7, xmm8          vpcmov xmm2, xmm11, xmm0, xmm2          vpxor xmm6, xmm6, xmm9          vpcmov xmm8, xmm6, xmm7, xmm0          vpcmov xmm7, xmm11, xmm8, xmm7          vpxor xmm9, xmm9, xmm7          vpcmov xmm7, xmm6, xmm1 , xmm7          vpxor xmm13, xmm4, xmm7          vpcmov xmm14, xmm11, xmm8, xmm3          vpxor xmm8, xmm8, xmm2          vpcmov xmm9, xmm9, xmm13, xmm3          vpcmov xmm6, xmm7, xmm12, xmm6          vpcmov xmm9, xmm9, xmm14, xmm5          vpxor xmm6, xmm6 , xmm0          vpcmov xmm14, xmm14, xmm9, xmm5          vpxor xmm10, xmm10, xmm8          vpxor xmm14, xmm14, XMMWORD PTR [rcx]          vpcmov xmm10, xmm10, xmm11, xmm0          vmovdqa XMMWORD PTR [rcx], xmm14          vpcmov xmm4, xmm4, xmm13, xmm10          vpxor xmm2, xmm2, xmm4          vpcmov xmm8, xmm8, xmm10, xmm3          vpcmov xmm3, xmm6, xmm2, xmm3          vpcmov xmm3, xmm3, xmm8, xmm5          vpcmov xmm8, xmm8, xmm6, xmm5          vpxor xmm8, xmm8, XMMWORD PTR [rdi]           vpxor xmm3 , xmm3, xmm5          vmovdqa XMMWORD PTR [rdi], xmm8          vpxor xmm9, xmm9, xmm5          vpxor xmm3, xmm3, XMMWORD PTR [rsi]          vmovdqa XMMWORD PTR [rsi], xmm3          vpxor xmm9, xmm9, XMMWORD PTR [rdx]          vmovdqa XMMWORD PTR [ rdx], xmm9 

(2009.05.02 append)

AVX + XOP has the delusion of the individual to the reality of change. The above code will be able to use in practice. Maybe.