Indianapolis (IN) - As the years spent in multi-core processing begin to fly by, a clear trend is emerging: More can be far better. Applications have access to enormous volumes of affordable parallel compute ability that was literally impossible five years ago. Today's average desktop PC can be equipped with more than have over 2 TFlops of computing horsepower for $2000. That would have put your desktop computer in the Top 500 list of supercomputers just two years ago. But with all of the options available today, which do you choose? Here is an overview.
In the not so distant past, in order to get extremely high computing throughput a specialized ASIC or FPGA was required. These were typically customized to specific tasks and, while they did those things very well, they were not easily scalable or adaptable to new workloads. They were also very costly to setup and required significant development time. Something more was needed.
In the 1990s, the trend toward supercomputing-on-a-desktop began, though the industry had not yet recognized that reality. The early accelerated video cards were thought of nothing more than ways to make games and Excel spreadsheets draw faster. The cards provided massive amounts of parallel compute abilities, but were never applied to general computing (outside of highly focused/specialized applications of that technology). The connection between the additional compute power and real world workloads just wasn't made yet.
In the early 2000s that trend began to change. While extreme high-end specialty accelerated cards had been available for a few years, the average high-end video card makers began to recognize that they too had many orders of magnitude more compute ability than even the highest-end CPUs. And while their former market was limited primarily to gamers, their price points made such high compute abilities very attractive to general consumers.
We began seeing modifications to video cards through software libraries. No longer would products from ATI or Nvidia simply do better graphics faster, but now it would provide real compute abilities and speedups not possible previously without extremely expensive high-end cards.
In this article, we look at five high-end contenders priced that are likely appeal to enthusiast, enterprise power and professional users. ATI's highest end graphics card, the RV770-based 4870; Nvidia's highest-end offerings the GTX 280 and T10P Tesla; Clearspeed's CSX700 and Tilera's Tile64. All of these products are available today and all of them have proven their abilities to significantly increase performance through parallel computing.
Read on the next page: Coprocessor overview
Breakdown comparison chart
| Available High-end x86-based Coprocessors |
|||||||
| Description | AMD Radeon HD 4870 | NVIDIA GeForce GTX 280 |
NVIDIA T10P "Tesla" |
Clearspeed CSX700 | Tilera Tile64 | ||
| Available? | Yes | Yes | Yes | Yes | Yes | ||
| Release Date | Jun 27, 2008 | Jun 16, 2008 | Jun 16, 2008 | Jun 17, 2008 | Aug 20, 2007 | ||
| Development | C, C++ CAL/Brook+ | C, C++ CUDA |
C, C++
CUDA |
Eclipse IDE, C, gdb and csprof | MDE 1.x, C, C++ |
||
| Price | $549 | $449 | $1,699 | $3,570 | $435* | ||
| *CPU only. Fully encapsulated development boards (plus software) are available from Tilera for $18,000. |
|||||||
| max 32-bit FP | 1,200 GFlops | 950 GFlops | 1000 GFlops | 192 GFlops | 80.64 GFlops | ||
| max 64-bit FP | 240 GFlops | 100 GFlops | 100 GFlops | 96 GFlops | 40.32 GFlops | ||
| 128-bit FP |
Yes, shader,texture |
Native, shader |
Native,
shader |
No | No | ||
| Max Watts |
333 observed |
364 observed |
170 | 12 | 30 | ||
| Idle Watts | 219 observed |
189 observed |
120 | 2 | 300 mW | ||
| Core count | 800 | 240 | 240 | 2 x 96 | 8 x 8 | ||
| 32-bit Flops/core | 1.5 | 3.96 | 4.17 | 1.0 | 1.26 | ||
| 64-bit Flops/core | 0.3 | 0.42 | 0.42 | 0.5 | 0.63 | ||
| 32-bit Flops/watt | 3.6 | 2.61 | 5.88 | 16.0 | 2.96 | ||
| 64-bit Flops/watt | 0.72 | 0.27 | 0.59 | 8.0 | 1.34 | ||
| Redundant cores | ? | ? | ? | 8 | 0 | ||
| Redundant cores are design aspects which allow faulty cores to be replaced or substituted with additional, extra cores that are normally not used. |
|||||||
| Core clock | 750 MHz | 1,296 MHz | 1,296 MHz | 250 MHz | 866 MHz | ||
| Memory clock | 2,200+ MHz 256-bit GDDR3,4,5 |
2,214 MHz 512-bit GDDR3 |
1,600 MHz 512-bit GDDR3 |
250 MHz DDR2 |
800 MHz DDR2 |
||
| On-die Cache | ? | ? | ? | 4MB | 5MB | ||
| On-die cache represents the total amount of L1 Instruction, L1 Data, L2 and L3 if existent. |
|||||||
| Bandwidth | 115.2 GB/s
PCIe-16 |
141.7 GB/s PCIe-16 |
102 GB/s PCIe-16 |
2 GB/s PCIe-8 |
6.25 GB/s PCIe-8 |
||
| IEEE 754 | 32,64 | 32,64 | 32,64 | 32,64 | 32,64 | ||
| IEEE 754R |
No | Yes | Yes | No | No | ||
| ECC | No | No | No | Full | Full | ||
| ISA | VLIW | VLIW | VLIW | VLIW | VLIW | ||
| VLIW = Very Long Insturction Word, typically 64-bit or 128-bit. Allows several instructions to be executed per word, often increasing parallelism. |
|||||||
| FPUs/core | n/a | n/a | n/a | 2 | 0 | ||
| Process | 55nm | 65nm | 65nm | 90nm | 90nm | ||
| Transistors Die Size |
956M 260 mm^2 |
1400M |
1400M ? |
256M
? |
? ? |
||
| Special Features | 1x PCIe
CrossFire |
1x PCIe SLI 2/3 Dedicated video hardware |
1x PCIe SLI 2/3 Dedicated video hardware |
1x PCIe Dual on-die DDR2 memory controllers |
2x 10 GbE 2x XAUI 2x PCIe Quad on-die DDR2 memory controller |
||
| Form factor | BGA | BGA | BGA | 1429 BGA | 1517 BGA | ||
| Items in italics are estimates. |
|||||||
There are clear offerings regarding performance per watt. Clearspeed's CSX700-based products are priced extremely high, but there is a reason for that. Each card delivers a specific amount of focused performance with a very low wattage overhead. This results in a performance per watt factor of 16. The next closest is Tesla at 5.8, hence its higher price relative to the similar performance delivered by the GTX 280.
In the world of massively parallel supercomputing, it's not always the best solution to throw more watts at an application. The heat generated by cards consuming 300+ watts result in greater cooling expenses and electricity expenses. There may also be real hard limitations to efficient cooling resulting in additional errors.
Nvidia has already stated publicly that their compute failure rates are around 1%. This is based on a sampling of approximately 500 GPGPU users on Folding@Home. A study carried out showed that approximately 1% of the computations carried out by Nvidia resulted in some form of failed processing. We believe the result is similar for ATI, and the full ECC support present in CSX700 and Tile64 greatly decrease that possibility. In addition, while these high failure rate percentages are not an issue for video games, they can be absolutely fatal for their CUDA library which is supposed to carry out non-gaming calculations for real-world users.
In addition, there are also 64-bit considerations for certain apps, and IEEE 754 compliance (which, up until this most recent generation of products released by these companies was a real issue). All of these factor in to choices made for which product to purchase. It's not always an issue of looking at the price, and it's not always an issue of looking at performance.
Read on the next page: Trends and Conclusion
Coprocessor compute engines by ATI, Nvidia, Clearspeed and Tilera represent a significant performance boost over traditional CPU-powered solutions for certain applications - at much cheaper prices. An overclocked QX9775 Skulltrail dual-socket motherboard, the fastest platform currently offered by Intel, achieves a maximum sustainable performance of around 100 GFlops, and that machine consumes around 600 watts and would cost the average user over $8000.
The fact that we have evolved from video cards into this world of high-end parallel coprocessor cards speaks very strongly of the adaptive nature of this industry. Supercomputers installed literally a year ago differ significantly in design and function from those installed this year.
For example, NCSA's "Abe" was installed in July, 2007. It contained no coprocessors and was simply a host of 1200 Dell server blades. Their newly added Lincoln machine will go live in October and is comprised of 192 Dell server blades and 96 NVIDIA S1070 Tesla cards. All told, Lincoln will deliver approximately 2/3rds the sustained compute ability of Abe, though using significantly less equipment, power and cooling.
Changes like these are beginning to happen in all areas of industry. Tilera recently told TG Daily on a phone interview that the new customers they've acquired in the past year have often come from previously specialized equipment arenas, such as ASICs and FPGAs. The much lower costs and flexibility from common toolsets make products like Tile64, CSX700-based boards, as well as Telsa and even high-end graphics cards so much more attractive than regular CPU-based compute engines.
The future is clear. The parallel co-processor is here to stay. Future offerings by Intel and AMD regarding Larrabee and Fusion will be the forerunners of the technology our kids will use to do the things we only see today in Hollywood movies. And massively parallel computing is just the tip of the iceberg. The real boon will be new software abilities which come from it.
In terms of computing capacity, what was literally impossible to achieve five years ago is now accessible to most everybody. I cannot help but wonder what it will be like in 20 years, about the same amount of time from initial 80386 adoption until now. Care to guess?









Workout of the Day