Opinion – As you walk along the three levels of West Moscone Center in downtown San Francisco, you’ll eventually come across five areas Intel has provided at IDF 2007 which relate directly to the Tera-Scale Project. Some of these are showcase research efforts where the public could physically go and speak to the engineers working on this technology. And one of them showed the actual Tera-Scale machine in full swing.
Big machine, little package
I don’t think many consumers, or even members of the press, have yet realized the full potential or significance of Tera-Scale. In fact, I believe this research effort to be the biggest news coming out of Fall IDF 2007. While there are no new bits of information being showcased, Tera-Scale now consumes over 100 separate in-house projects and is the single biggest research effort at Intel for upcoming products.
Intel also recognizes Tera-Scale’s significance because they devoted an rather observable amount of floor-space to this un-marketable product at their event. But even so, I believe it needs to be more.
I had the opportunity to sit in on a session entitled “Path to Petascale and Beyond”. The information being conveyed in that class linked specifically to future multi-core, multi-processor efforts, and specifically those which will get us to the Petaflop level (1000 Teraflops, roughly 1000x more powerful than the fastest supercomputers today).
The speaker was talking about 1,000,000 CPU packages, and “10s of millions of threads” running simultaneously. He said that the limiting factors will not be the compute abilities, but rather the interconnect technology.
Increasingly, the interconnect fabric demonstrates its significance. One slide in the presentation demonstrated this very clearly: The “ASCI Red” supercomputer, which is comprised of 10,000 Pentium Pro processors clocking around 233 MHz, used an interconnect architecture which is notably different than today’s supercomputers. That slide demonstrated performance in scaling and how much more powerful that very old ASCI Red machine is when the workloads become very large. In fact, they outpace much faster machines today on extremely large workloads, and it’s owed all to one thing: The interconnect.
Tera-Scale is no different in its design. The biggest limiting factor for Tera-Scale has been its message routing system. Basically, the many CPU cores inside of Tera-Scale are laid out in a grid of nodes. Each node contains several CPUs inside (eight were used in the demo, though it can be any number). Each CPU has a crossbar routing system which allows it to speak directly to any other CPU in its node. And to speak to any other CPU in any other node, it uses another lane of communications.
By operating in this way, common workloads can be coordinated by the OS to operate physically in the same node. In addition, because any CPU can talk to any other CPU through this system, workloads can be shifted around as necessary to other cores, without anything outside of the CPU being aware of what’s happened.
But it’s the interconnect technology that makes it all possible. And this is the area where Intel is devoting significant research. The idea of answering the question of “How can Tera-Scale communicate efficiently with its many other cores, and at very high speeds” is the same kind of questions being asked in the highest-end supercomputing world.
Memory per FLOP
A surprising bit of information was given in the petascale presentation. It related to the amount of memory typically required on average for each FLOP of performance. For x86 it is about half a byte. So if you consider that in order to attain a gigaflop you’ll need memory bandwidth of about 500 MB/s. And for four gigaflops, you’ll need two 2 GB/s.
The L1 cache is more than able to satiate the bulk of this need. The L2 becomes slower and uses a less high throughput system. This is one area that the speaker believed Intel needed to revisit the drawing board for future, large-scale implementations.
Since computational engine performance needs data, it has to be fed at a rate commensurate with its need. And while for x86 it is half a byte per FLOP, other implementations and workloads require more or less. I was told that Tera-Scale requires approximately one third byte per FLOP.
It was very interesting listening to the direction Intel is taking with petascale efforts. There are some current industry beliefs that need to be shaken up before these research efforts can move forward and produce real-world products, according to the speaker. And Intel is moving forward in these areas and carrying out very wide-ranging research efforts into the petascale realm, as well as the Tera-Scale Project.
I believe that barring some kind of engineering breakthrough, we’ll be seeing the biggest advances in the years to come from these kinds of research efforts. The speaker indicated that CPUs aren’t getting faster any more. They’re getting wider. We’ll see 8 cores, 16 cores, 32 cores, 64 cores and even more. That’s how this industry is going to get more compute abilities. And it will be the research efforts like these which make it possible.