Tech Tour Terascale: Developing Terascale 2

Posted by Rick C. Hodgin

Santa Clara (CA) and Hillsboro (OR) – TG Daily editor Rick Hodgin was able to catch up with Intel’s Terascale development teams in Silicon Valley and Oregon and get an update on the future of the company’s 80-core processor:  Terascale is evolving quickly into a system that can achieve 2.04 TFlops at 6.26 GHz in high performance mode or hit 1 TFlops at just 55 watts.

The Terascale Project is the blanket name given to the single largest future-product research effort taking place at Intel right now.  It was originally created to develop a tool to put supercomputing abilities on a chip.  Since that time, however, the project has grown in scope and size:  Intel has initiated the fundamental research work on developing the next version.  Given the traditional 24-month cycle, the follow-on project could be available as soon as March, 2009, though no specific dates or timelines were given by Intel.  Many of the architecture's details were outlined in a previous article. Review that article first to get an understanding of Terascale.

When I spoke with the Terascale folks at the recent Fall IDF 2007, we were told that approximately 100 separate teams were involved with Terascale research in one way or another.  Today that number has pared down a bit.  In addition, the current research efforts have shifted focus slightly.  They're heading more toward developing usable software algorithms and software models.  These research efforts will eventually feed not only the real products which will be available for Terascale, but also the internal hardware design as Terascale has been designed with heavy considerations on the software side.  The team looked at what was needed there, and then built the machine which carried it out.

The team previously identified many of the problems associated with Terascale computing, namely the interconnect.  Its high volume of data must be moved efficiently and that involves a very high-speed, complex, on-die network.  In fact, during the Fall IDF, we were told in one of the classes that the computing core they included in the design were almost a secondary consideration.  It was the interconnect framework which received primary focus, and it's what really made Terascale possible.

The Prototype

The Terascale Project's prototype came with 80 discrete cores and, when running at 3.13 GHz, achieved a sustainable 1.02 Teraflops of throughput from a single silicon die and package.  It also did this with just under 90 watts of power consumption on a 65 nm structure.  When I visited the lab last week I saw the same demonstration consuming only 55 watts on the same 65 nm technology indicating just how mature the Terascale research effort is at this point.

The 80-core chip was demoed and put through significant paces before my very eyes.  A few dozen algorithms were used to carry out various functions.  These mostly included mathematical programs like data modeling, convergence and complex formula solving.  It also included a generic spreadsheet, ironically enough.  The Terascale team told me they wanted to make sure the chip could handle anything that was thrown at it.


 
 

Terascale 2, a programmable IA core machine

The original 80-core chip was comprised entirely of non-IA cores.  These cores are not like the general purpose cores seen today in modern x86 CPUs.  In fact, they weren't even as complex as the original 80486DX.  They were more generic, small math cores which required special attention and hand-coding to program.

However, we've learned that Intel's next-gen Terascale chip will be comprised of programmable IA cores.  This nifty attribute will allow Terascale 2 to utilize much more common development and debugging tools, making it easier to eventually release a marketable product and demonstrate the algorithms they've been developing in the labs.  These will all feed into what will likely be the Terascale 3 revision, which could be the first commercially released Terascale product.  Note:  We were never told Intel is using the Terascale 2 or Terascale 3 naming conventions.  We are using them here simply to indicate products and revisions.

Prototype lab

While touring the Hillsboro, Oregon, lab, I saw the actual chip in use.  In addition, I spotted a station where the team physically assembled the chip packages containing the 80-core Terascale die.  We were told that the team physically created about 30-40 Terascale chips which were sent around the world for various testing purposes.  There were also saw several Terascale wafers with many of the dies and memory cut out, with several more remaining.

The team had set up a demo which allowed me to witness the various effects of high speed computing on Terascale.  The tools they had developed demonstrated each core's activity via a type of color-coded on-screen status display.  It served as a control panel of sorts, developed just for Terascale.  From there, applications could be sent to the chip and computed, and results gathered.  Approximately 40 different programs had been developed to test all aspects of Terascale's design.

The control panel also had the ability to throttle the CPU's clock speed with a vertical slider.  Instead of throttling in MHz speed, the adjustment was in Teraflops, specifically from 0 to 2.  The tool, based on whatever speed was selected, would automatically do the math and select the optimum settings which, through observation and experimentation, allowed the test to achieve the desired throughput.  It basically adjusted the CPU's clock rate and input power as necessary for the workload.

In one extreme example, the team was able to ramp the clock up to 6.26 GHz, delivering a full 2.04 Teraflops of computing power on a single die.  Of course, when they did this the power jumped up enormously to over 225 watts.  To keep the chip from destroying itself immediately from the excessive heat, they had a very large copper water block connected to a huge and powerful chiller, which was about the size of a college refrigerator, and very loud.  It kept the chip quite cool at around 10C.  Under extreme operations, the temperature would momentarily blip up to nearly 40C.  The cooler would sense this and kick in with extra cooling capacity bringing it back down to around 20C in just a few seconds.

At one point the team shut down the water cooler to demonstrate that at 3.13 GHz and only 55 watts, it could be adequately cooled by the copper block, or in a real production system, it could've been cooled by a regular heatsink/fan.  55 watts is a fairly typical modern power consumption for x86 CPUs during normal use.


 
 

Physical setup

The team had the prototype motherboard setup with very large wire connectors, about 5/16" diameter each.  It included a custom developed interface between a Windows XP box and the Terascale motherboard, using 25-pin ribbon cables for data.  Those connections operated at 12 MHz and were used only to send and receive data, including program instructions, to the Terascale motherboard.  There were a dozen or so separate debug ports used for monitoring the system, with most of them tapped and connected to various machines nearby.  In addition, there were numerous inputs for the various voltage, amperage, and clock speeds used to power the chip and board.  Each of those values was programmed externally by the control panel software on the Windows XP box.  The values used were determined through experimentation over time.

Intel was using very large power supplies, about the size of a half-height toaster oven.  These could ramp very quickly to the various power loads required to make Terascale work.  Huge amperages and low voltages with high precision were available.  All in all, the Terascale setup they had in place appeared to be very mature at the time of my visit.  Everything had been setup to work with just a few clicks, and all of the readouts were shiny, to borrow a Firefly phrase.

The 80-core prototype chip is no longer an active research chip.  It's now pretty much a finished core.  I got the impression it's being used today only to show it to pesky journalists like me and others touring the facility.  While the demonstration is effective, the real question is what's to come?  The direction the Terascale team has in mind will bring us the Terascale 2 chip.

 

 

Read on the next page: Terascale 2, demonstrations 


 

 

 

 

Terascale 2

I learned that Terascale 2 is currently about 6 months into its development cycle.  In about 18 months from now we could see the final version.  It will arrive on 45nm process technologies.  I was told the design Intel is using can be scaled to any process Intel has defined.  These could include 65 nm, 90 nm, 130 nm, etc., as well as future 32 nm and below.

While the team would not disclose many details about what Terascale 2 might entail, as many of those details are not even known to the research team themselves as they are still in flux or being worked out, I was able to find out a couple of things.  First, Terascale 2 will be populated with an undisclosed number of IA-cores.  None of the original prototype cores will be used in Terascale 2, nor will there be any other types of cores.  This means no form of heterogeneous computing is currently planned in Terascale 2.  

The other thing I learned was that Intel is working very closely with reconfigurable hardware.  The idea is to create something which, based on how data is sent to it, can take on various computing abilities.  The details of how this works are easy to understand from a high-level point of view, but from a CPU design they can be very tricky.  Basically, think of the traditional black box.  When configured a particular way, that box will add two numbers together.  By changing some internal setting within the box, it takes on a new ability.  It reconfigures itself so that rather than adding two numbers, it now multiplies them, or something else.

Terascale 2 may include this kind of programmable IA engine.  The specific details of how far this will go were not disclosed.  However, the interest was shown in an interview with Justin Rattner, as well as follow-up email after the visit.

As of right now, the team is focused on setting hard goals for Terascale 2.  They're working with CPU models and cycle-accurate simulations, trying to find the correct balance between core count, computational desire, power budgets, etc., using the lessons learned from the first Terascale.

Seeking powerful software

There is a tremendous effort underway regarding the need to have easy-to-code parallel computing methods or models.  I interviewed several software developers working on the team during my two-day visit, and they all told me that many large questions remain unanswered on how the problems can either be solved, or best be solved.  Everyone I asked indicated Intel is looking toward the best direction for long-term solutions, and not just short-term fixes.  Still, I was told command decisions might change that.

The software teams definitely have a plan in place moving forward, but the details of everything that effort entails are still under wraps.  The goals they're seeking are the scalable ones.  Intel wants to be able to create software models which take our existing dual-, quad- and future eight-core systems, and with only a recompile, will scale to future Terascale and Larrabee-like chips with their dozens to hundreds of cores.

Programmable Terascale 2

When I visited the Hillsboro lab I was told the machine they use for simulating Terascale 2 was currently in Germany.  It is the same Field Programmable Gate Array machine (FPGA) I saw at Fall IDF, operating at approximately 1 MHz.  That machine simulated the full number of cores, interconnects, and everything in Terascale 1 in a real manner.  While this 1 MHz number might seem amazingly slow, and it truly is compared to the 3.13 - 6.26 GHz in the silicon version, it's still much faster than computerized cycle-accurate simulations which typically run around 1 KHz.

I was told that the team, when developing the 80-core prototype, worked in something like "stages" for development cycles.  They would test something in software first using the cycle-accurate simulation.  These were the fastest and least expensive changes to make, often consuming only minutes.  Many changes could be tried before the final answer was found.  Once simulated and tested, and once the team was satisfied they'd found the solution they were after, those changes would be incorporated into the code used to program the FPGA.  I was told the FPGA was only updated once a day or so, on average.  And, once updated, it would execute tests much faster than the simulation, providing more throughput for better data analysis and testing.

The same kind of machine is being used for Terascale 2.  However, I did not get a chance to see it nor did the team discuss the methodologies they're using right now for development.  In fact, from what I was able to gather from our conversations, it seems the team is not yet finished with the base Terascale 2 design.  They seem to have most all of the major details hammered out, such as the concept of using only IA cores and sticking with a similar memory model which physically places the memory above the CPU in a 3D fashion, but they're still working on some aspects of its design.  Questions like how many cores, what they include, and whether they should be programmable, are still in the works.

Demonstrations, early products and software directions

Intel's Santa Clara facility has a demonstration lab which allows for Terascale algorithms to be put on display.  There are half a dozen computers and several large monitors which show the fruit of various software algorithms being developed for Terascale.  While many of these ran on quad- or oct-core machines with high-end GPUs, the idea is to create algorithms that can be easily piped to Terascale 2 and its successors.  The demonstrations I saw were often running in real-time or near real-time.  I was told that when running on Terascale 2, the performance is expected to be “super real-time”.  This means that many seconds of video, audio or data can be processed in every real second.  This would allow for high definition video to be processed, for example, in much the same way MP3 encoders today can encode at 20x or greater throughput from the source WAV file.

Demo: Soccer game

According to the demos, Terascale 2 is already being prepared to bring some amazing abilities to the table.  In one example, I saw a previously recorded soccer game.  The machine had been loaded with special software which took a model of a soccer game, its players, the ball, and various correlations about how all of those components relate together, and analyzed the 2-hour game in super real-time using only an eight-core machine.  The entire soccer game was scanned for key events, such as goals, face-offs, out-of-bounds, etc.  The software looked at multiple components to determine what events were taking place at any given second.  This even including the "roar of the crowd" as an aid for determining the "excitement level".

After the scan was complete, a list of identified highlights was shown.  The user could click on each one and see that part of the game.  Obviously, if models for other sports were created, such a design would be very desirable for scanning through many saved games from a weekend's worth of TiVo recordings.  It was also discussed that when these kinds of abilities are truly available, broadcasters will begin to send out real-time meta data or annotation for the broadcast, allowing for even faster, more accurate manipulation.

Demo - Video cleanup

Another video demo included enhancement technology.  This one utilized the phenomenal compute abilities of massively parallel computing, and specifically those geared toward the future Terascale 2 chip, to enhance regular video, but in super real-time.

In this example, a 320 x 200 x 15fps cell-phone captured video was shown without any software correction.  It looked just like what you'd expect.  Then next to it, following a complex and compute-intensive analysis, a fixed-up version was shown.  The fixed-up version had been corrected for anti-aliasing, frame rates, jitter, color oddities, and several other graphical attributes.  The end result was a video that looked like it had been captured at a much higher frame rate and resolution.

Intel explained that with such technology available, specifically through the massive compute potential of a future Terascale based product, no more data storage or bandwidth would be required.  A video in existing quality could be derived and displayed in better quality at view-time just by applying Terascale computing algorithms.  These would result in significant, and correct, video enhancements which make hardware limitations less visible because the software can correct for what the hardware can't do.  And this is made possible in real-time only by low-power, massively parallel computing.

Primitive demos

Intel representatives also explained that the current algorithms they have employed are still relatively primitive.  We discussed several other possibilities they're working on which would extend those abilities to include some amazing enhancements.  One such example included video smoothing, meaning taking a video which is badly shaken and making it appear still around a central object of focus.

While this kind of video smoothing is possible today, such as centering each frame around a particular object like a child snow skiing downhill, the video is typically cropped so that only those portions which are visible in all frames are shown.  However, with Terascale it would be possible to analyze the video and determine, frame to frame, what components from the surrounding frames could be used to fill in those portions which are missing from a given frame of the video.  This would make the video appear much closer to its original size, a desirable visual trait.

Future algorithms will include the ability to take information from various parts of a longer video and reassemble them out of time.  Suppose, for example, a sign was visible in one part of a video but the resolution or focus wasn't sharp enough to read it.  If, at another point and time in the video, the sign was clearly displayed, then that data could be brought back to the other section of the video and inserted as an enhancement or fix up.  The sign would then be readable at all points because the software utilized information available from the video itself to help make the video better.

This would work for many other aspects of the video as well, including scenes, colors, people, objects, anything that Terascale can identify as a component that is located in multiple instances throughout the video, and is perhaps only partially shown in others.

How will this kind of very intense analysis and fixups be carried out?  It requires that the low-power, massively parallel compute abilities be applied to something called "model-based computing".

Read on the next page:  Model-based computing, author's opinion

 


 

 

 

 

Model-based computing and Terascale 2

One of the biggest focuses on the software side today is the model-based computing engine.  It is really only made possible by massively parallel computing as there are enormous numbers of computations which must be made in real-time, or super real-time, for it to be effective.  Model-based computing basically looks at recognizable patterns or traits, and then applies those observed attributes to a list of associated meanings or other sets of data.

To give an example, consider Google's recent attempt to perform a language translation test using a form of model-based computing.  Basically Google told its computers to look at known language translations and observe patterns.  From those patterns it developed computer models which were then applied to their test translations.  The Google tool had no idea it was translating languages.  So far as it was concerned, the engine simply looked at the models to identify that when it saw this particular grouping of words, that a certain percentage of the time it translated to these other words over here.  It had a large enough sample base that the models were very accurate and complete.  In the end, the model-based engine Google created was far more accurate than any other translator out there.

It's just too difficult to design rules for every possible situation or exception, plus it's very costly in terms of development time.  If existing data can be sampled, especially large bases of existing data, through the application and creation of such models, then the exact same abilities are exposed in a more generic manner, one which is demonstrated through real-life, resulting in much better computer use.

This same kind of modeling can be used for a much wider range of software needs, from gaming to daily use.  And eventual products like those based on Terascale, can make it real for the mainstream.  They allow for currently unfathomable ideas to become possible, and then real.  And some of these attempts are beginning to look toward the situational-aware computer.  These future computers will identify not only where you are and what you're doing, but the context under which you're doing it.

Situational aware computer assistance

Intel gave an interesting example of a situational aware computer.  Suppose you're flying to a town you've never been to before.  The computer would have a type of model related to your daily life and normal activities.  It would be able to determine from a GPS signal, for example, that you're not at any of your normal places of operation.  The device might also be tied in to your calendar or itinerary which, through model recognition and personal association, is able to identify that you've scribbled a particular hotel down.  It can then look up the hotel by name and city, from the GPS or flight data, and find its address.  The computer, by using these massively parallel compute abilities and an application of several model-based algorithms, essentially begins to interpret things which stand out from the ordinary.  It applies the basic model mapping technology to those odd things and, is able to do something like automatically present the shortest route to the hotel.  Or maybe it will present the best method of transportation to get there.  It could even be programmed to contact the local taxi companies and have a cab waiting for you.  It would do all of this from the very instant you get off the plane.

Another example would be the ability to analyze intent in real-time.  Imagine, for example, holding up a Mobile Internet Device (MID), which is kind of like a small tablet PC, in front of you on a dark night.  The MID would know where you are from the GPS signal.  It would know which way you're pointing the device, from several cues including compass direction as well as the ability to analyze and parse surrounding buildings and features from the on-board camera.  The parallel compute abilities could then perform a remote lookup with a future version of Google Maps, one which presents you with a street-level view of the very area you're looking at, but one which corrects it for your point of view and the time of day.  It displays an immediate overview of the surrounding area.

You could find where you want to go by holding up the MID.  It will present an image which would, from your eye's perspective, look like you're holding up a sheet of glass which, through a simple command, makes everything bright and sunny from previously recorded images being displayed on the screen.  As you move it up, down, right, left, turn it, etc., the massively parallel computing abilities inside automatically update the display in real-time, so it tracks constantly with your literal surroundings.  It becomes a glass view into the virtual world which augments your physical world.

All of these abilities, and so many more, are made possible by model-based computing and the low-power, massively parallel computing abilities from something like Terascale.  The always-on high-speed internet, be it WiMax or whatever, would also be required to provide access to the correct databases and tools for such devices to thrive.  And right now, all of these technologies are in their infancy.  They are available however, even today for the ultra high-end.  And in 10-20 years, they will be standard fare on our future cell phones or iPods.

Conclusion and Author's opinion

Intel's Terascale developmental effort is continuing in full force.  It is the single largest future-product research effort taking place at Intel right now.  According to CTO Justin Rattner, over half of the Terascale research budget is being spent on finding software solutions and computing models for future products.

Terascale is the first physical product seen in silicon which truly has the potential to change our world.  The demonstrations Intel has already developed, those which show what is possible with so much parallel computing potential, have the real opportunity to make a difference to each individual man, and not just mankind.  There is no doubt that computers have affected mankind in every way.  But tomorrow, thanks to efforts like Terascale, individual people will truly have their direct lives impacted by the capabilities of a massively parallel computing engine at their personal disposal.  The unimaginable will become the standard in the not too distant future.

Terascale represents, in this author's opinion, a potential for the true beginnings of what mankind will ultimately record as "true computing".  Everything we've seen up until now will later be viewed as the typewriter equivalent of the color laser printer's abilities, which are coming.

Those children being born today will use computers when they enter the workforce, computers as different from ours as those we use today are from the manual typewriters of the 1940s.  And I believe it is exactly an applied product like Terascale, basically the low-power, massively parallel computing engine and all of the requisite software which will be developed over time, which will make it all possible.

TG Daily attended an Intel sponsored event to facilitate this article.