8/17/2009

Intel : Advanced Encryption Standard (AES) Instructions Set White Paper

Using AES-NI with Parallel Modes of Operation

This chapter explains how throughput can be enhanced with AES using parallel modes of operation. Consider the code snippet described in Figure 7, for encrypting AES-128 in ECB mode. In this example, there are 8 data blocks in xmm2-xmm9, and a Round Key is loaded into xmm1. For each round, 8 AES round instructions are dispatched, operating on the 8 data blocks with the same Round Key. Then, the next round key is loaded. The 8 blocks encryption results are eventually stored into memory, ready load a new set of 8 data blocks. This way, the program encrypts 8 data blocks simultaneously, but the order is different from the order shown in the previous chapters. Instead of completing the encryption of one block and then continuing to the next block, the code computes one AES round on all 8 blocks, using one Round Key, and then continues to the next round (using the next Round Key). This “loop-reversal” technique is applicable to any parallel mode of operation such as CTR and CBC decrypt (but not for CBC encrypt). The underlying fully-pipelined hardware implies that AES instructions could be dispatched “each cycle” if data is available. In a parallel mode of operation, using the AES instructions and loop-reversing software, data can indeed be made available in (almost) every cycle.

The following rough performance estimate illustrates the performing gain (neglecting loads and stores). Suppose that the latency of the AES instructions is L cycles, and L ≤ 8 (the actual latency of the AES instructions is L=6 cycles and this example uses 8 xmm registers). Then, encryption of the 8 data blocks would be completed after (roughly) 88+L cycles (pxor is done within 1 cycle). Therefore, the obtained throughput is around (88+6)/8=12 cycles per block (16B), which approaches the theoretical throughput limit. This simplified estimate ignores several factors (e.g., loads/stores), but nevertheless, the measured effect is quite close to the estimated one.

[result]
  • Intel nehalem CPU 3Ghz 4Core - 1000M AES/Sec
[Link]

No comments:

Post a Comment