Santa Clara (CA) - Last week, Intel released eight technical papers providing details about its Tera-scale project. TG Daily had an opportunity to discuss the technology with Jerry Bautista, director of technology management at Intel. Could Tera-scale become the x86 killer?
The Tera-scale project is currently has over 100 separate teams work on it. Intel is working on everything from electrical foundations all the way up to the software. Some of the research Bautista was able to share with us indicated how powerful this project is and why Intel is throwing so many resources at it.
In February 2007, a prototype chip was built on 65nm process technologies. It clocked at nearly 3.16 - 5.8 GHz, had 80 separate compute cores operating internally, and it ran through six different customized benchmarks with each using traditional compute burdens. The result was a remarkable 1.01 Teraflops of parallel computing on just 62 watts of input power (1.63 Teraflops at 5.1 GHz and 175 watts, and 1.81 Teraflops at 5.7 GHz and 265 watts). While that level of computing for a single chip is impressive in and of itself, the process and mechanics of how Intel got there are at least as impressive.

Off-the-shelf logic
Intel used mostly off-the-shelf logic components for its prototype. This means that arithmetic units, memory controllers, internal routing technology, caching, and everything else, was either used exactly as it had already been developed, or with the barest minimum of customized changes. This technology re-use enabled Intel to take a research project from drawing board to prototype in less than a year. The Tera-scale project was first announced publicly in March, 2006.
Tile design
One of the most powerful features of Tera-scale is the cookie-cutter like nature of its design. We were told by Bautista that it does not really matter what compute engines are inside each core. In fact, when Intel was designing the overall system, the actual contents of the compute cores were literally of almost no importance. First and foremost was the scalable bus architecture, which allowed any one of cores to communicate directly with any of the others. Bautista called this a "one to any" communication method.

The prototype itself used 80 homogeneous cores. We were told it could have used any number, and they did not have to be homogeneous. The reason Intel chose 80 cores was because the design specs allowed for a certain number of transistors. And basically with the memory/logic tradeoff they had in mind, the company settled on the 80-core number because it provided enough memory and compute cores to prove the new idea works. It could have just as easily been 200 cores, 50 cores, or any other number because of the on-board communication system, Bautista said.

Communication
Intel uses a tier-based communications system for Tera-scale. The overall design was comprised of ten blocks of eight cores each, called nodes. Each node can communicate with any other node, and subsequently to any other core within the whole CPU. Each core communicates directly with every other core in its node, but uses the node-to-node routing system for everything external to its node.
Bautista stressed that this generic routing system is the highlight of Tera-scale. It allows anything within a node to communicate with anything else on chip.
In addition, cores do not have to be just general compute cores. They can be of any specialized design. In fact, a single node could have an array of heterogeneous cores within. And even more, each node does not have to have anything except the external communication router technology which allows it to communicate with every other node. This means that each node does not have to be comprised of compute cores. As far as the design goes, each node could be anything. DSPs, compute cores, parallel FP engines, anything.
Take special note about what this means and just how powerful it is. Intel has designed a system which encapsulates all of the components necessary for a multi-computer system, but the company has done so on a single die. There is a powerful node-to-node communication protocol which holds it all together, which makes Tera-scale CPUs able to come off the assembly line with widely varying compute abilities. As long as each node is built to spec, it can literally be dropped in, or as many of them as are necessary could be dropped in, resulting in amazingly varied compute abilities for whatever task is targeted.
Does your application needs lots and lots of parallel floating point compute abilities? Add some extra FP nodes. What about high-speed integer processing? Add some specialized RISC cores. Or maybe high-end memory bandwidth and very large caches? Then add a high-cache node to support block operations.
The cookie-cutter nature of this design allows flexibility in compute abilities that we are not used to. New processors with more FP, more integer, more fusion-like technology combining GPU + CPU compute abilities, etc. These are all possible very quickly if they are designed to Tera-scale's protocols.
Implementation
The Tera-scale prototype was built in layers. The memory sits on the bottom of the chip. It communicates vertically with the core, which is above. Each core has a dedicated 64 MB of RAM to itself. With 80 cores, that was only 5.12 GB, but Intel was limited by the design specs and a certain number of transistors.

The communication system was implemented quickly using off-the-shelf components of what Intel already had developed. It had somewhat limited bandwidth, though even without being customized as it would be for a final product, it achieved an aggregate bandwidth of 1.2 TB/s.
The prototype also used a reduced clock signal distribution system. Whereas traditional CPUs assign about 30% of their power budget to clock distribution, according to Bautista, Tera-scale uses only about 10%, which was enabled by having fewer "repeaters" throughout.
This allowed the ones Intel had in place to communicate their signals to more distant locations with less power. The problem the company had to overcome using this solution, though, was that the clock signals arrived at their more distant destinations nearly a fully cycle out of phase. To accommodate this, Intel just did the math; Clock signals were mathematically adjusted for how far out of phase they would be at certain points. This allowed all clock signals to be properly in sync with much less power and greater flexibility.
Read on the next page: Routing, self-correction, caching and design of the Tera-scale chip