ClearSpeed's massive FPUs will soon head into space

Posted by Rick C. Hodgin

Bristol (UK) - On September 4, 2007, ClearSpeed Technology and BAE Systems jointly announced a licensing agreement which allows BAE to take ClearSpeed's silicon IP and implement it directly in proprietary hardware applications for use in space.  Communications satellites and remote image processing engines could see a tremendous power-saving benefits from its massively parallel 64-bit FPU abilities.

 

CSX600
The technology being licensed is a massively parallel FPU engine capable of 50 Gflops per CPU at single-precision, or 25 Gflops at double-precision.  The board is fully IEEE-754 compliant, except for denormal values which are treated as zeros.  Each production board sold by ClearSpeed today is comprised of two CSX600 CPUs and interconnect hardware.  This allows up to 100/50 Gflops in single/double precision per board.  The boards contain their own on-board memory facilities so they scale linearly with an increase in count.  Power consumption per CPU is rated at 10 watts typical, making the 100 Gflop sustained throughput 20 watts on the CPUs, and typically 33 watts on the card.

ClearSpeed CPUs are sold in boards like the e620.  However, BAE will not be using a production ClearSpeed boards for their applications.  BAE has licensed the internal intellectual property (IP) used to create the chips.  They'll be able to take the CSX600 design and incorporate it into any application they want.  These could include stand-alone chips like the CSX600, but will likely be more fully integrated units using their own, proprietary packaging for use in space.

 

Terms
ClearSpeed's license grants BAE the ability to take their VHDL "semiconductor source code" and create their own chips with it.  By using this source code base and compiling it for their own needs they will be able to produce either a stand-alone CPU or custom applications.

BAE will not only have direct access to ClearSpeed's IP, but it will also have full access to all of their developed tools and technologies to make the CPU function.  Libraries, source code, analyzers, basically everything that ClearSpeed offers today is included in the deal, we learned.  The core software includes an open-source driver, runtime libraries, CSXL math library, C compilers with built-in parallel programming extensions, standard C libraries, random number generator, vector math library, GDB debugger, visual profiler and instruction set emulator for development, testing and debugging.

Specs
The CSX600 CPU core itself is laid out like a high performance engine.  Everything is designed for parallel execution and maximum throughput.  Delivering 96 GB/s, the shared, internal 128 KB "scratchpad" memory system allows extremely high-speed temporary storage.  Off-chip memory requests are communicated on DDR-II pathways at 3.2 GB/s.  The CSX600 is also equipped with dual 3.2 GB/s chip-to-chip channels for use in pairs.

The core itself contains 96 high-performance processing units, each with its own dedicated 6 KB memory.  It uses 64-bit virtual addressing and 48-bit physical addressing, which is common for modern 64-bit capable processors.  Its 96 internal processing units are capable of 25 billion sustained double-precision (64-bit) floating point operations per second (25 Gflops) and 50 billion single-precision (32-bit, 50 Gflops).  Up to 250,000 complex FFTs per second with 1,024 points are also sustainable.

 

 

Each CPU has its own on-board DMA controller, as well as on-die instruction and data caches.  Each processing unit is self-contained and communicates externally with the on-chip data network via standard protocols.  This would allow the CSX600 design to be extended to include more processing units without significant difficulty as it's basically cookie cutter.

Each CPU can run as a type of "slave" to the main system's PCI bus, or it can operate on its own as a CPU with its own operating system.

Cost
The CSX600 board cost around $7000 and includes two CSX600 processors.  Considering that they deliver scalable 50-100 Gflop performance per dual-chip card, and that they consume only 20 watts of power, they apparently have found many homes in high-performance compute environments.  They are much more efficient than general purpose CPUs for highly parallel compute abilities.  And their double-precision (64-bit) abilities exceed those of current graphics cards.

Big picture
There is a clear movement in the semiconductor world toward massively parallel compute abilities.  When Intel announced a few years ago that they saw a future with 80 cores, even several hundred cores, most people thought they were thinking wishfully.  But as we move forward and see the benefits of specialized compute engines applied generically in this cookie-cutter fashion to be essentially "bolt on extensible", the advantage are clear.  Lower power, faster throughput, and the ability to leverage more of what is needed for many applications:  Parallel compute abilities.

I believe it won't be too many more years before we begin to see a truly modular approach to CPU component designs.  A common set of on-die interconnect protocols will be created which truly allow cookie-cutter compute blocks to be added onto a single silicon die, thereby extending the base abilities of the CPU with specialized processing units.  AMD has referred to this as the Torrenza platform, though eventually they will see GPU integration and non-GPU processing through open-source socket protocols.  Intel's Tera-scale processor is ideally suited for this kind of expansion as it already has the on-die communication infrastructure to allow massive routing and bandwidth between compute cores.

Things are changing very quickly in the semiconductor landscape.  As fundamental components are wielded now logically via a computer model, with the computer doing the hard part of taking the model and converting it into something which actually works in silicon for the designer.  The physical has become the abstract.  The abstract has taken the shape of an idea.  And the idea is limited only by how far the human mind can push it.

As we stand today, products like CSX600, Tera-scale, Torrenza/Fusion will be those which pave the way for future compute abilities.  Individually they may or may not make it in the marketplace, but it will be their design which wins in the end.  Shared, corporate, executive computing in a common framework supporting all internal processor cores.