AES Speed on NVIDIA GPU : 47.1Gbsp


2009/501 ( PDF )

Fast Implementations of AES on Various Platforms, Joppe W. Bos, Dag Arne Osvik, and Deian Stefan

NVIDIA GTX 295, 1.24Ghz : 59.6 Gbps - key expansion in texture memory

NVIDIA GTX 295, 1.24Ghz : 47.1 Gbps - key scheduling is on-the-fly

equal to 184M/sec

Emilia Kasper bitslice on 4 core of intel : 108.38M/sec

Intel AES Instruction on 4 core of intel : 127.66M/sec

key point : proposed the byte-slice method




FASTRA II: the world’s most powerful desktop supercomputer

Is it possible to fit the computing power of a large supercomputer cluster in the tight space of a PC case? In our research on image reconstruction we often have to perform large-scale scientific computations, which can easily take weeks on a normal PC. Last year, the FASTRA project was launched to develop a desktop supercomputer based on gaming hardware. Although highly successful, even FASTRA cannot provide the computational power required for our latest research projects. FASTRA needs a successor, which should be much more powerful, while maintaining the favorable properties of its older brother: green, mobile and inexpensive. For just 6000 euros, you can have 12TFLOPS of computing power at your fingertips.
Part of the Vision Lab of the University of Antwerp, the research group ASTRA focuses on the development of new computational methods for tomography. Tomography is a technique used in medical scanners to create three-dimensional images of the internal organs of patients, based on a large number of X-ray photos that are acquired over a range of angles. ASTRA develops new reconstruction techniques that lead to better reconstruction quality than classical methods.
One of the applications is 3D imaging of bone tissue in mice, which is commonly required in medicine research for osteoporosis. The structures of interest are at the resolution limits of current micro-CT scanners. We are working on advanced computational methods that allow for the computation of higher resolution images based on the same scanner data. The downside: computation time, which was already a major issue, increases even further.

Fortunately, these computations can be carried out in parallel on graphics hardware, much faster than when using normal CPUs. Graphical Processing Units (GPUs) are becoming more and more common now for all kinds of scientific computing. For suitable applications, a single GPU already has the computing power equivalent to a moderate CPU cluster. In collaboration with Tones.be and ASUS, We have now developed a PC design that incorporates 13 GPUs, resulting in a massive 12TFLOPS of computing power.
Although the system is up and running, we are still experiencing software stability issues, probably caused by an incompatibility between the video drivers and the BIOS and Linux modifications we had to use. Check out the blog for more details on the current status of FASTRA II.
The FASTRA II design contains six NVIDIA GTX295 dual-GPU cards, and one GTX275 single-GPU card. To fit all this hardware in a single PC case, a special cage was designed for the graphics cards, which are connected to the motherboard by flexible riser cables. To satisfy all 13 GPUs, The system has four power supplies. At full speed, it can outperform a moderately sized cluster of state-of-the-art CPUs. And guess what… this system costs less than 6000 euros!


Hardware overview
Hardware assembly
Software overview

Hardware overview

Case: Lian-Li PC-P80 Armorsuit

Lian-Li PC-P80 Armorsuit
The PCP80 case, which was also used in the FASTRA I build, provides a massive amount of working space and offers 9 expansion slots at the back of the case. Although the graphics cards in FASTRA II do not directly fit into these slots, it provides big ventilation gaps for releasing exhaust heat from the cards. The case had to be modded slightly for this project, by drilling holes for attachment of the GPU rack screws.

Motherboard: ASUS P6T7 WS Supercomputer

ASUS P6T7 WS Supercomputer
The Asus P6T7 motherboard is the only workstation motherboard available that has seven full-size PCI Express slots. The X58 chipset is connected to two additional NForce 200 chips that distribute PCI Express bandwidth between the seven slots.

CPU: Intel Core i7 920

Intel Core i7 920

Managing 13 GPUs simultaneously requires heavy multithreading on the CPU side, requiring a multicore CPU. As nearly all computational load is shifted to the GPUs, we opted for the CoreI7 920, which is highly affordable, while allowing for future upgrade possibilities.

Memory: 6×2GB Corsair DDR3 1333

6x 2GB Corsair DDR3 1333MHz

Having as much RAM as possible is crucial for our type of applications. 12GB is usually sufficient to load large 3D volumes (e.g. 1024×1024x1024) completely in memory. We would have loved more memory though. Compared to a “real” supercomputer we have only a tiny amount of memory at our disposal. The Corsair memory has decent timings, at an affordable price point. Remarkably, the total amount of GPU memory in the FASTRA II system is about the same as the total amount of system RAM. It is not strictly necessary that all GPU memory is backed up by an equal amount of system RAM, though.

Harddisk: Samsung Spinpoint F3 1TB

Samsung Spinpoint F3 1TB

At first this choice may seem somewhat strange. The Spinpoint harddrive is not very fast compared to more expensive models such as the WD Raptors or Solid State Disks. However, we observed that in our case, disk access is not a performance bottleneck at all. In particular, having a single harddisk improves the airflow through the case and keeps the system very tidy.

Power Supply: Thermaltake Toughpower 1500W + 3x Thermaltake PowerExpress 450W

Thermaltake Toughpower 1500W Modular PSUThe Thermaltake Toughpower already proved itself in the FASTRA I design. It has four PCI Express x6 and four PCI Express x8 connectors and powers four of the GTX295 cards. However, this power supply cannot power all graphics cards simultaneously. As the bottom of the case is already occupied, we decided to use the special VGA power supply offered by Thermaltake, which fits into a drivebay. Each PSU has connectors to power two graphics cards, but we use only one. This PSU takes one drive bay, as opposed to its bigger 650W brother, which takes two drive bays.

Graphics Cards: ASUS ENGTX275 + 4x ASUS ENGTX295 (2PCB) + 2x ASUS ENGTX295 (1PCB)


As the system has been assembled over a rather long period of time, two types of GTX295 cards have been used. We started with a series of dual-PCB cards (part no. 90-C3CGX0-K0UAY00T). The more recent single-PCB cards (part no. 90-C3CGX5-K0UAY00T) generate less heat, which can also be obeserved clearly in ourheat camera images. For the single-GPU card, which is connected to the screen, we opted for the GTX275, which is newer than the GTX285 and almost as powerful. This card has to be a single-GPU card for technical reasons, restricting us to a 13-GPU system (and not a 14-GPU one).

Flexible PCI Express risers: Adex Electronics PE-FLEX16 gen. 2 risers

ASUS P6T7 WS Supercomputer
The flexible risers from Adex Electronics can be ordered in different lengths, and allow for sufficient flexibility to connect all seven dual-slot graphics cards to the tightly spaced PCI Express slots of the motherboard.

GPU suspension cage: Custom Design

ASUS P6T7 WS Supercomputer
A strong cage is required to keep all cards in place above the motherboard. In collaboration with Tones.beand the firm LASERTEK N.V., a cage was designed and manufactured out of aluminium that meets the requirements.

Hardware Assembly


The Belgian computer shop Tones.be provided assistance and support during this project, and performed the assembly of FASTRA II. They managed to deliver a very clean build, despite the vast number of power and riser cables involved.

Software overview

Operating System: Linux, CentOS 5.3

We selected CentOS because it provides a stable environment that doesn’t need much maintenance. Instead of the standard CentOS Linux kernel, we used a custom kernel.

Tomography Code: C++ and MATLAB 2009b

We use portable C++ for the core functionality of our software. In Windows, we useMicrosoft Visual Studio 2005, and on Linux, the C++ code can be compiled using the GNU C++ compiler. We’ve also developed a front-end for MATLAB. MATLAB has an easy to use interface and thus allows rapid prototyping of new algorithms. All GPU code is developed using the NVIDIA CUDA framework, a C-like programming language that allows for efficient programming of the NVIDIA GPUs.