Quietly but surely, we are heading into a new computing era that will bring one of the most dramatic changes the IT industry has seen. Acceleration technologies will inject lots of horsepower into the CPU, increasing the performance capability of the microprocessor not just by 10 or 20%, but in some cases by up to 100x. These new technologies, which are expected to become widely available through heterogeneous multi-core processors, create challenges for software developers – but Intel claims to have found a way to make the transition easy.
Accelerators, which most commonly provide additional floating point capability, have been discussed for some time. Most recently, ATI (now AMD) released its stream processor card and Nvidia is leveraging its GeForce 8 to make its graphics cards available to general purpose processing. Both AMD and Intel are working on CPUs that will integrate graphics cores as well as other “accelerators” in the future. While this new type of integrated processors will open the door to much more demanding applications that will include physics processing and simulation, they will bring a whole new set of requirements to programmers. Knowledge of multi-threaded programming, expert knowledge to fine-tune threading and knowledge to exploit the hidden capability of hardware appears to require a whole new approach on how to develop applications – in some way a whole new generation of programmers.
Intel’s recent disclosure that it is working on a pain-free multi-thread programming model has prompted us to dig deeper. Is there a secret sauce to make the horsepower of future integrated heterogeneous processors available to every developer – without asking for specialty knowledge?
Let’s have a closer look.
What developers are dealing with today
The multi-thread waters of programming are very troubled, it’s as simple as that. Much of the problem stems not from hardware, but rather from an inefficient and very difficult to use software model of that hardware. Without diving into very low-level forms of code, which are costly, time consuming and prone to error, it is difficult to efficiently create a multi-threaded environment. This reality is becoming ever more pronounced as we begin the migration from homogeneous to heterogeneous programming where non-x86-based processors are being called upon to do work in parallel.
The solutions using today's hardware are often extremely difficult to code or coordinate efficiently, requiring special drivers and tools. One scenario utilizes the GPU for parallel processing, in which the operating system (OS) calls for special drivers and runtime code packages must be linked to the application targeted for acceleration.
In such an environment, we have to pay attention to some serious roadblocks: First, there are many different OS versions of the toolset, which must be distributed. Each of them costs money and keeps the product from hitting all OS platforms. Second, that very reality limits the ability for accelerated parallel processing outside of those supported operating systems. This not only makes the lag time from idea to product often unjustifiable on alternative platforms, but it also makes it very clear that something more is needed to bring this high power to everyone.
We need something to satiate not only the parallel abilities we already have today, but we need to consider the growth curve of the accelerator products we'll have tomorrow. A new software model is required to keep up with current and future hardware advancements. And it needs to be one which operates as globally as possible.
The dual-core lesson
In looking back at the dual-core and quad-core evolutionary steps, we have found that there were two main problems, which halted or hindered early adoption of multi-thread programming. These also limited performance potential and throughput to something much less than the machine itself was capable of.
First, many applications use algorithms which do not work for multi-thread processing. This immediate stumbling block, such that A must be computed before B can be processed, removes any possible advances a multi-thread programming model might ever offer to those applications. However, all applications can, in at least some way, take advantage of parallel processing.
However, the reality is that if multi-thread programming were easier to use and understand, then it might already be incorporated into even those applications which won't see much benefit. This would be true just because those resources are physically there and, in the case of being globally accessible, would be easy enough to use.
Also, and probably much more common, is that no one gave any thought to multi-threading when the software was originally developed. We all were using single-core processors until a few years ago. The goals of developing software might only have been to get it to work. In those cases, no concern was given regarding high efficiency, let alone multiple cores.
In addition, any current thoughts of re-engineering existing software, especially those programs that already work properly, just to gain some performance benefits might not justify the expense of doing so. There are undoubtedly users who would benefit, but if there is no real financial incentive to port a functioning application to multiple cores via a multi-threaded model, then why do it?
But the realization that third-party processors are becoming more and more available is warming up. The inertial mindsets of the past are slowly fading away as benchmark data for multi-thread apps is seen more often. For example, ATI's CTM/Stream Processing and Nvidia's CUDA both show the future wide open with potential due to their massively parallel floating point abilities. When added to existing software encoders, for example, performance increases of several hundred percent are a common sight. And due to recent software libraries, those abilities are now exposed and harnessed by CPU-based software for non-GPU based processing to the general developer. Still, they are highly specialized.
Projects like AMD's Torrenza initiative demonstrate exactly how much the need for high-performance, efficient multi-thread programming for heterogeneous processors will be. The truth is, it's not only becoming a reality, but in order to stay ahead in the game such models will soon be considered a necessity.
Recognizing these limitations has placed a need upon the software community. While we do have the hardware resources available to carry out very efficient multi-thread programming on multi-core architectures, the reality is that the developers themselves need high-skills to make it work. This often places a skillset gap between desires and practical realities. No efficient software solution has been proposed that allows the average higher-end developer to create code that maximizes the use of hardware facilities with a minimal coding expense. Only when this problem is solved, the real benefits we're all hoping to see with multiple cores and parallel processing will become a reality.
Read on the next page: Intel’s solution: As good as it gets?
Intel’s solution: As good as it gets?
With today’s problems in mind, here is a potential answer Intel’s engineers have come up with. The hopes are their model will address the needs of software developers for future multi-threading tasks.
Intel’s proposed method of achieving efficient use of multi-core processors easily finds some parallels to virtualization. The proposed solution requires minimal OS support which makes the technology almost immediately available to any platform using the x86 processor. Something called an Exoskeleton is loaded at bootup. Once that's done, all facets of whatever co-processors are available will be directly exposed to the application programmers, and without application-specific runtime libraries or version conflicts.
So, you may wonder, is it a good solution? Does it address much with little?
Hong Wang, senior principal engineer with Intel’s Microarchitecture Research Lab told us that the Exoskeleton employs technology which allows it to operate primarily outside of the OS. The Exoskeleton operates via opcodes inserted directly into the binary executable. In this way, coordinating between external accelerator resources and the main software program is handled directly by the CPU in its own native language - binary code.
Intel CPUs with this new ability will directly recognize those new opcodes. It will immediately instruct the accelerator to handle whatever is required. It does this via something called an Accelerator Exoskeleton software layer, which runs transparently to the OS, yet is visible to the application and communicates with the external resources.
This new ability allows whatever parallel resources there are to be visualized by the software developer as mere extensions to the CPU itself. There are no complex relationships between specialized pieces of accelerator hardware. It can be visualized in much the same way as the MMX or SSE engines. In fact, Wang told us these new accelerators will operate in the same virtual memory space each task is using in the CPU, thereby sharing resources.
According to Intel, this new technology will not use “escape sequences”: In the past, the FPU was not integrated into the CPU. In the 8086 through 80486SX processors, the co-processor was implemented externally to the CPU and even had its own socket. As such, the CPUs of those days were sending every instruction byte of data to the FPU in parallel. Each time an escape sequence was encountered by the FPU, it began processing. At the same time, the CPU itself went into a wait state until the FPU was done.
This escape sequence methodology was obviously extremely inefficient. But the FPU was able to compute in hardware many hundreds of times faster than could be emulated by the integer-only CPU via software. So, even when the CPU sat there waiting for the FPU to complete its work, and even if it took hundreds of clock cycles (which it did), the end result was still much, much faster computing on 32-bit, 64-bit and 80-bit floating point as well as large integers.
The model still used by the FPU and CPU is called SISD, or “Single Instruction Single Data”. In such a model, only one instruction stream executes at any one time and on only one piece of data. Intel and AMD extended their processors with MMX, SSE/2/3/4 to include the SIMD model, which is “Single Instruction Multiple Data”. This model uses one instruction stream but physically processes multiple data items in parallel. This mild parallel example shows us why SSE2 apps are typically much faster than FPU-only apps. SSE2 can process up to four 32-bit floating point values simultaneously.
Wang said that Intel’s new model will no longer have the CPU waiting. We'll now have what is called MIMD, or “Multiple Instruction Multiple Data”. This allows the CPU to "kick off" an external instruction stream, or even multiple streams, written in whatever form the accelerator requires. The CPU will then immediately continue on without missing a beat. This means the accelerator(s) and the CPU will be running different instruction streams at the same time, computing on multiple data items.
And as far as the OS is concerned, Intel has informed us that the OS's involvement with this entire new form of processing is minimal. It will require a barest minimum of saving and restoring context states when task switching. This could be handled in parallel outside of the CPU, thereby making it a cost-free operation, we learned.
All of these disclosed features answer the question above - yes, Intel's solution does do much with little. But what about flexibility?
Read on the next page: Is Intel’s solution ready for the future?
Is Intel’s solution ready for the future?
The heterogeneous processors won't always be just GPUs. The reality is that future co-processors will take on many forms and process all kinds of data. In what manner does the Intel solution address these future processors - perhaps even those using the unknown technologies of tomorrow?
According to Wang, there are several ways the Exo-sequencer could be upgraded or altered to expose new technologies. The update would happen like this: The user would install a new accelerator card of some kind, upgrade or extend the Exo-sequencer's software layer by loading the non-OS specific portion, and then reboot to expose the new abilities to any application wishing to use them.
There should be plenty of flexibility in Intel’s technology as a result.
There are also questions surrounding the cost of this technology. Will it add too much expense to be useful? And what if it's not used? Will the extra cost in internal processing overhead slow down existing software?
We observed a very small number of technological difficulties in creating a virtualization engine. It can, therefore, be reasonably concluded that this new Exo-sequencer technology should also be comparable. This is true largely because there isn't much new hardware required to coordinate external processors. All of the facilities to communicate with these resources are already available today in the CPU. It would only require that extended or virtualized communication abilities be added to what's already present in the CPU core itself. And everything added there would be for only one purpose: Software communication directly instructing external resources.
When you boil it all down, the only real new thing the CPU would do is allow the external resources to appear as mere extensions to the x86 ISA. That would, in fact, be its role in this parallel model – that of the coordinator.
There will also be hardware signaling available back to the CPU from the Exo-sequencer. As a result, multi-threaded components (like semaphores and critical section updates) would be available to software directly via hardware. This means the CPU could fire off some task to process and when that task completes it would automatically signal back. This takes some of the more difficult aspects of coordination out of multi-thread programming. By doing this transparently in hardware, it will greatly speedup coordination between multiple threads. Especially so when you compare the speed of immediate hardware responses to that of the CPU-driven software solutions in use today.
When we look at the potential new abilities we're seeing, it could easily be that the complexity would weigh in. However, the truth is it really the opposite. Nearly everything necessary to handle these new ideas is already there inside the CPU. It just needs to be used differently.
Intel has indicated there will be relatively simple extensions to the existing architecture. This should keep the cost factor relatively low. And if we also conclude that in the very near future we simply must have these new abilities to even keep pace with performance growth curves, then it becomes academic. The necessity will mandate that the cost be shared by all, which will also keep it low.
But what if these abilities are not utilized because accelerators are not available? Or what if the Exo-sequencer software layer is not turned on or installed? Will the system be slower in that case?
Intel's solution indicates that no practical software slowdown will be perceptible. Whenever something is virtualized there will almost always be new delays. Still, these delays should amount to individual clock cycles here and there over minutes or hours of processing. That amount of slowdown is something that might only be observed as a second or two loss over months of continuous use. Therefore, existing software should not run slower on an Intel CPU with these abilities than it would without these abilities.
In addition, as far as physical cost of manufacturing goes, we anticipate the die size growing only slightly to accommodate the new virtualization and logic. I would estimate something on the order of less than a 2% increase max. This small increase comes because much of coordinating the workload will be handled by the Exo-sequencer software layer and not by new hardware.
All of this means that pretty much everything needed to expose Intel's new solution is already present. It will simply provide a way outside of the OS to make visible any accelerator resources which are present and accounted for. It should, therefore, be no more costly to the consumer, either in terms of dollars or performance hits regardless of whether or not the technology is used.
And finally, we need to look at the overall breadth of this solution. Are there alternative solutions which offer better pathways to multi-thread processing on multi-core architectures? Perhaps one which doesn't require new hardware?
The concept of multi-thread programming has been around for decades. The reality is that many software solutions exist today. There are many which can even utilize the hardware very efficiently. But it is extremely difficult to realize the full potential of multi-thread programming in multi-core architectures in a more global scope, especially those where heterogeneous cores are involved. It all stems back more to developer inabilities and software tools than any hardware limitations. That fact results in the largest chunk of expense seen in efficient multi-thread software today. The goals therefore of any solution hoping to tackle these problems should be to make it cost less. A solution must be efficient enough to make it worthwhile to include while also being easy enough to implement.
There are also hardware solutions which provide greater potential for parallel processing. Various processors models attempt to address these concerns by the way they expose their hardware. However, the value is not found in these specialized solutions because x86 processing is ubiquitous. In reality, what the designers of these x86-based parallel processing extensions are hoping for is the replacement and removal of such need as specialized CPU hardware. They want a more common CPU, one which is less specialized to particular markets, but one which includes powerful abilities to coordinate new variable resources. Those variable resources, by the way, will likely accomplish the same performance goals as the more specialized solutions, but will do so with lesser expense and validation time. Plus it will have the opportunity to reach more potential users.
Software solutions also exist today in every popular operating system. Server versions of operating systems are often geared extensively toward the needs of multi-threading and multi-tasking, even going so far as to have special builds for server versions. Most modern software developers there have at least casual knowledge of the nuances of multi-thread programming. Still, apart from the more intense, complex software-based solutions out there, there are no real efforts which operate at the hardware level with minimal regard to the OS to carry out these tasks. Every solution we see, no matter how pronounced its benefits might be, requires that they be tightly integrated with either the OS, the application, or both. And such a solution automatically limits itself before it gets out of the gate.
What has been proposed by Intel is a solution which leverages the complete x86 base of existing software. It allows every application to extend its hardware abilities transparently to the OS. It incorporates a software layer which operates almost entirely outside of the OS's awareness, thereby allowing extended abilities to operate on any platform with only a small OS patch applied. The opcodes used to coordinate the extended hardware are incorporated directly into the binary executable. This will allow software developers to ship a single version of their program, one which incorporates both the accelerated and non-accelerated versions side-by-side with one install.
In conclusion, Intel's solution allows extension abilities to be added to any OS with minimal effort. Their claim that a learning curve of weeks instead of years seems justified. The new hardware compensates for many of the largest limitations seen today. Applications using these extended abilities need only include new opcodes directly in their own binary to begin using multi-thread programming and multi-core processing via multiple instruction streams on heterogeneous processors (say that three times fast :)).
The only real requirements are these: You must have an Intel CPU with this technology. You must have a version of BIOS which supports it. And a minor software patch will be required for the OS to handle the additional task switching overhead. If those factors are present, then any application taking appropriate advantage of accelerator extensions will provide performance speedups to the user.
Read on the next page: Author Opinion: As good as it gets - or is there more?
Intel's Exoskeleton solution is certainly the most elegant we've seen. They've taken the requirements of exposing accelerated abilities to as many software programs as possible with the absolute minimum OS intrusion. Their solution is the most universal and will likely become the de facto model for other x86-based hardware vendors, including AMD. In fact, we look for something similar to be announced by AMD for their Torrenza and Fusion technologies.
Still there is one facet we would like to see added. Just as the hypervisor can work entirely outside of OS awareness, it would be very nice if these new hardware extensions were provided, possibly even optionally, as mere logical extensions to the CPU. In that model they would literally be completely OS-unaware and all software would operate in a new CPU state which simply executes software code to perform its new abilities. All hardware task switching would be handled directly by the Exo-sequencer software layer and not by the OS. It would be triggered from within the CPU by a new hardware signal which instructs all external accelerators that a hardware task switch is occurring.
This one, small component would make this new technology potential the icing on what already appears to be an idea with a promising future.