Showing posts with label AES. Show all posts
Showing posts with label AES. Show all posts

3/22/2010

5x speedup for AES using SSE5?

http://www.ddj.com/hpc-high-performance-computing/201803067


DDJ: What types of applications will benefit from SSE5 extensions?
LV: We see three markets where SSE5 will deliver the most immediate impact: High Performance Computing (HPC), multimedia applications, and security.
HPC workloads are increasing and showing up in non-traditional HPC domains. Examples of this are seismic data processing, financial analysis such as stock trend forecasting, and protein-folding algorithms that are used for medicine development. These algorithms require fast floating-point matrix and vector processing capabilities which SSE5 delivers. A floating-point matrix multiply using the new SSE5 extensions is 30 percent faster than a similar algorithm implemented with the existing SSE instructions.
Multimedia is an increasingly important part of the computing experience. Media processing and encryption (DRM) have become a major part of PC workload; new algorithms and formats have been developed, including MPEG-4 and H.264. SSE5 enables enhanced geometry transforms and physics modeling for scientific simulation and gaming, supports HD Video encoding and decoding and enables image enhancement and MP3 recording and manipulation. For example, Discrete Cosine Transformations (DCT), which are a basic building block for encoders, get a 20 percent performance improvement by using the new SSE5 extensions.
Security remains a top concern for the entire industry. SSE5 enables encryption algorithms to run more quickly increasing the usability of security features in the platform. For example, the Advanced Encryption Standard (AES) algorithm gets a factor of 5 performance improvement by using the new SSE5 extension compared to an AES implementation that just uses the AMD64 instructions.



5x speedup for AES using SSE5?
The best figure I obtain on an AMD64 system is 11 cycles/byte, whichmatches your results (you had me worried for a while with 9 cycles/byte!)To go 5 times faster than this would mean close to 2 cycles/byte, aspeed that I find hard to believe without hardware accelerationBut a fully byte oriented implementation runs at about 140 cycles/byteand here the S-Box substitution step is a significant bottleneck.  I toothink the PPERM instruction could be used for this and it seems possiblethat this would produce large savings.  So 30 cycles/byte might well beachievable in this case.I hence wonder whether this is the comparison that AMD are making.It is also possible that the PPERM instruction could be used to speed upthe Galois field calculations to produce the S-Box mathematically ratherthan by table lookup. I have tried this in the past but it has notproved competitive.  But PPERM looks interesting here as well.   Brian Gladman

I've only just seen this, but I've been playing with the VIA's AES and looking at Intels AES instructions.  I believe the PPERM instruction will be rather important.  Combined with the packed byte rotate and shift some rather interesting SIMD byte fiddles should be possible.  >From my initial look, it should be possible to implement AES without tables, doing SIMD operations on all 16 bytes at once. I've not looked at it enough yet, but currently I'm doing an AES round in about 140 cycles a block (call it 13 per round plus overhead) on a AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions.  I don't believe they will be taking 30 instructions , so they probably have 4-8 SSE instructions per round, it then comes down to how many SSE execution units there are to execute in parallel.  As for VIA, on a 1ghz C7 part, cbc mode, 128bit key, for 16byte aligned, I'm getting about 24 cycles per block, for unaligned, about 67 cycles.  The chip does ECB mode at 12.6 cycles a block if aligned (2 at a time).  It does not handle unaligned ECB, so with manual alignment, 75 cycles.  Not bad for a single issue cpu considering the x86 instruction version of AES I have takes 1010 cycles per block.  For the intel AES instructions, from my readings, it will be able to do a single AES (128bit) in a bit more that 60 cycles (10 rounds, 6 cycle latency for the instructions).  The good part is that they will pipeline.  So if you say do 6 AES ecb blocks at once, you can get a throughput of about 12 cycles a block (intel's figures).  This is obviously of relevance for counter mode, cbc decrypt and more recent standards like xts and gcm mode.  Part of the intel justification for the AES instruction seems to stop cache timing attacks.  If the SSE5 instructions allow AES to be done with SIMD instead of tables, they will achieve the same affect, but without as much parallel upside.  It also looks like the  GF(2^8) maths will also benefit.   eric (who has only been able to play with via hardware :-(


1/27/2010

AES Speed on NVIDIA GPU : 47.1Gbsp

ePrint


2009/501 ( PDF )

Fast Implementations of AES on Various Platforms, Joppe W. Bos, Dag Arne Osvik, and Deian Stefan



NVIDIA GTX 295, 1.24Ghz : 59.6 Gbps - key expansion in texture memory

NVIDIA GTX 295, 1.24Ghz : 47.1 Gbps - key scheduling is on-the-fly



equal to 184M/sec



Emilia Kasper bitslice on 4 core of intel : 108.38M/sec

Intel AES Instruction on 4 core of intel : 127.66M/sec



key point : proposed the byte-slice method


8/18/2009

FSE 2009 : How Fast is AES?

FSE 2009 : 'How Fast is AES? ' by Emilia Kasper

[Intel]
using AES-NI, AES CBC speed : 70 cycles/block
using AES-NI in parallel mode, 2010' AES speed : 12 cycles/block

[Bernstein]
Asm in AMD Athlon 64 X2 3800+ : 166.88 cycles/block

[Emilia Kasper]
bitslice mode in Intel Core 2 Quad Q9550 : 129.6 cycles/block

[Emilia Kasper]
AES GCM mode in Intel Core 2 Quad Q9550 : 184 cycles/block

[Reference]

8/17/2009

Intel : Advanced Encryption Standard (AES) Instructions Set White Paper

Using AES-NI with Parallel Modes of Operation

This chapter explains how throughput can be enhanced with AES using parallel modes of operation. Consider the code snippet described in Figure 7, for encrypting AES-128 in ECB mode. In this example, there are 8 data blocks in xmm2-xmm9, and a Round Key is loaded into xmm1. For each round, 8 AES round instructions are dispatched, operating on the 8 data blocks with the same Round Key. Then, the next round key is loaded. The 8 blocks encryption results are eventually stored into memory, ready load a new set of 8 data blocks. This way, the program encrypts 8 data blocks simultaneously, but the order is different from the order shown in the previous chapters. Instead of completing the encryption of one block and then continuing to the next block, the code computes one AES round on all 8 blocks, using one Round Key, and then continues to the next round (using the next Round Key). This “loop-reversal” technique is applicable to any parallel mode of operation such as CTR and CBC decrypt (but not for CBC encrypt). The underlying fully-pipelined hardware implies that AES instructions could be dispatched “each cycle” if data is available. In a parallel mode of operation, using the AES instructions and loop-reversing software, data can indeed be made available in (almost) every cycle.

The following rough performance estimate illustrates the performing gain (neglecting loads and stores). Suppose that the latency of the AES instructions is L cycles, and L ≤ 8 (the actual latency of the AES instructions is L=6 cycles and this example uses 8 xmm registers). Then, encryption of the 8 data blocks would be completed after (roughly) 88+L cycles (pxor is done within 1 cycle). Therefore, the obtained throughput is around (88+6)/8=12 cycles per block (16B), which approaches the theoretical throughput limit. This simplified estimate ignores several factors (e.g., loads/stores), but nevertheless, the measured effect is quite close to the estimated one.

[result]
  • Intel nehalem CPU 3Ghz 4Core - 1000M AES/Sec
[Link]

AES SPEED



[Link]

AES Calculator

You can use the AES Calculator applet displayed below to encrypt or decrypt
using AES the specified 128-bit (32 hex digit) data value with the
128/192/256-bit (32/48/64 hex digit) key, with a trace of the calculations.
Some example values which may be used are given below.




Example AES test values (taken from FIPS-197) are:
Key:
000102030405060708090a0b0c0d0e0f
Plaintext:
00112233445566778899aabbccddeeff
Ciphertext:
69c4e0d86a7b0430d8cdb78070b4c55a

Key: 000102030405060708090a0b0c0d0e0f1011121314151617
Plaintext:
00112233445566778899aabbccddeeff
Ciphertext:
dda97ca4864cdfe06eaf70a0ec0d7191

Key: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f
Plaintext:
00112233445566778899aabbccddeeff
Ciphertext:
8ea2b7ca516745bfeafc49904b496089

Encrypting the plaintext with the key should give the ciphertext,
decrypting the ciphertext with the key should give the plaintext.

[Link]