Follow-up: Has Intel found the key to unlock supercomputing powers on the desktop?
Opinion – Last week we posted an article entitled “Analysis: Has Intel found the key to unlock supercomputing powers on the desktop?” in which I discussed several facets of a potential Intel technology without going into too many of the technical details. At the time of our posting, Intel had not yet publicly released the paper describing the technology, which prevented me from going into certain details. However, you can now find published papers using a model Intel is calling “EXOCHI” or Exoskeleton Sequencer C for Heterogeneous Integration. For example, here's a good link that takes you directly to a page with a PDF written by Perry H. Wang, et al, from Intel.
This PDF is available for download and explains the inner workings of EXOCHI and will answer many of the questions raised in the reader comments. But I have taken some time to address some of the key concerns raised in our original article.
It seemed to me that one of the most horrific fears found in those posting responses was that Intel is in someway trying to undo the OS industry or introduce something new which operates outside of the OS. While the Exoskeleton software layer definitely exists outside of the OS and would, by acting as a type of executive layer of sorts, operate almost entirely outside of OS control or driver library support, the OS interface is still a requirement. While I personally believe this strict non-OS ability would be a great solution, it's just not the one Intel is offering.
What Intel would like to do is make the process of using heterogeneous cores as painless as possible for all involved. The company’s solution includes a software model that will target as many operating systems as possible, but without tightly bound OS drivers or requirements outside of minor patches to increase the amount of data captured during a task switch.
A picture included in Intel’s paper demonstrates that a software layer will still be required. However, if you look closely, it is OS independent in both operation and function. It will provide a single binary that will communicate with the application running within the OS for all OS-related service requests, thereby requiring only that the application have special knowledge of the Exoskeleton software layer, not the OS.
As you can see, the EXOCHI model removes the need for an OS-coupled device drivers. This still allows a CHI runtime library to exist and be linked to your application. To allow for a more traditional approach when creating applications for Exo-sequencers, a software developer would write code for the runtime library requirements. That runtime library would then, in turn, handle all of the actual instrumentation and stream scheduling.
To the applications programmer, the new Exoskeleton software layer will be a black box with only the API provided by the CHI, should it be used. The exposed API can, therefore, be nearly as straight forward as it is today. The only real differences is it won't have the OS dependency or the abstraction layers seen in today's GPGPU model. This simplifies things greatly for the developer while targeting as many operating systems as possible right out of the gate.
My personal opinion is that this could turn out to be a brilliant move by Intel, and one which keeps any new facilities of hardware extremely close to hardware, but with the added flexibility of physically being a software layer that is non-OS dependent. This speaks to one of the proposed technology's greatest strengths.
Another common response in was that the real problem isn't addressed, that of software models or developer skill sets. Several comments indicated these are the real issues with multi-thread programming and that nothing Intel is adding will solve those current problems.
Today's requirements of multi-thread programming are extremely OS dependent. The application starts a new thread and, depending on what platform the OS is running on it will either schedule the new thread in the application's allotted time slice, or allow it to run on a new core, or some combination thereof. All of this is required because we're working on homogeneous cores where each core can only do one thing at a time.
With the Exoskeleton software layer and the Exo-sequencers, we will now have the ability to have many instruction streams running at the same time. The advantage of not having strong OS dependency is that the OS is already burdened enough with task scheduling on homogeneous cores. If it were to attempt to schedule tasks on heterogeneous cores, the results would be a much more complex tasking model for every OS.
Intel's solution addresses that weakness by providing to the application developer a model which would allow them to schedule threads themselves, without OS support, and with a much smaller learning curve due to the strength of Intel's EXOCHI tools in the Exoskeleton software layer and Exo-sequencer hardware layers.
The only thing the OS has to worry about is storing some additional task switching information. This results only in a slightly larger memory block being switched out each time a task switch occurs. The result is a software model which, so far as the main OS is concerned, is still a single-threaded app (or, if it's already multi-threaded, then it's a multi-threaded app). Any new threads launched on the Exo-sequencers happen without the OS knowing about it. There is still an OS layer, of sorts, which is not the main OS. It is the Exoskeleton software layer's communication protocols with the calling apps so that multiple tasks, multiple threads and multiple callers are all handled correctly.
All of this means that what Intel has done is basically introduce a new OS which is transparent and serves only one function, no matter what platform its running on.
It does, via software, what the OS would otherwise have to do in a specific-to-every-OS model, though it does it one time for all by having a software layer which exists closer to hardware than the OS. This benefit cannot be explained heavily enough. Intel is offering a way to create multiple threads on disparate pieces of hardware outside of OS awareness. If your application, for example, compiles correctly and runs under Windows and uses the Exo-sequencers properly, then because everything is specific only to your app and the x86-hardware layer its running on, then the same code will immediately port to Mac OS X, Linux, UNIX, Solaris, anything that runs x86. It will not require, from a straight-forward processing point of view (one outside of GUI requirements, for example), any changes. The software you write once, once recompiled and put into appropriate binary form, will work on all x86-based platforms.
This new ability means that software developers will see a notably smaller learning curve. Developers will no longer have to turn to OS-specific models, or books on theory only to then apply them to specific OSes. They will now be able to target the hardware itself from their point of view. While it will physically be implemented in a more virtual manner, from the developer’s point of view they now only have one thing to address to use any added Exo-sequencer. The developers of today who can look at the x86 ISA and use it will be able to write impressively engineered multi-thread code which can take advantage of disparate hardware solutions.
The software knowledge requirements will still be there, but the target for understanding will be much smaller and much more easy to code for. And therein lies its strength.
Read on the next page: So, what problem does this solution really solve?
Some posters asked the question about what does this solution really solve? How would its availability in software provide real speedups for common operations like encoding and decoding. I believe the MIMD (Multiple Instruction Multiple Data) would address that. Accelerator hardware could be used simultaneously to the CPU-based portion in codecs. This would allow workloads to be “shunted” to the accelerator hardware, thereby utilizing high-speed resources for computation while minimizing the CPU requirements.
Intel's EXOCHI paper by Wang, et al, indicated a 141% to 1097% speedup in video and image processing tools. While this kind of integration is possible today using current software models, the limitations of this technology again stem back to driver support. With Intel's proposed solution, the ability to include all required technology directly within x86-based binary executables result in a single-source solution which will operate on any platform.
Intel's EXOCHI PDF lays down a solid explanation for the benefits and pitfalls of the EXOCHI model. It introduces or extends some existing basic concepts which allow the heterogeneous nature of disparate cores to work easily within the x86 ISA and with minimal OS intrusion. These are briefly explained here. For a full understanding, please refer to the 10-page PDF file, section 3.0.
First is the Exoskeleton. The Exoskeleton is a type of hardware wrapper which enables x86 to work with the accelerator solution using a different internal architecture or ISA. This allows the accelerator to communicate back and forth between the x86 CPU via various instructions. The advantage here is that it's done directly by the application and not through OS service requests.
Next is an ability which enables the accelerators to process data on relevant blocks which the CPU itself might also be working on simultaneously. This is the Address Translation Remapping (ATR) mechanism. This device allows shared, virtual memory to be mapped correctly to physical memory via a translation mechanism between the CPU's Translation Lookaside Buffers (TLBs). The mechanisms which keep the virtual addresses in synch are designed to work correctly, however there are no mechanisms for cache coherency between the accelerator and the x86 CPU. It is still the responsibility of the application developer to maintain cache coherency on critical sections or whenever cache coherency might become an issue. I believe mechanisms which address this will eventually be present in the architecture, though not initially - due primarily to development and testing time.
Lastly, we have Collaborative Exception Handling (CEH). With CEH the main x86 CPU will receive and process all interrupts caused by the accelerator hardware. This allows any fault occurring on the accelerator to be directed back to the x86 ISA for proper handling. The mechanics of how this operates are similar in scope to exception handling models today. The primary difference being that any replaying of the faulting instruction are handled by proxy through the CEH module from the x86 CPU's exception handler algorithms.
The overview of this communication between the x86 CPUs and accelerators is shown here:
In closing, the information Intel has now released publicly about this technology answers a lot of questions. It also raises a few more.
Intel has been able to demonstrate a working prototype using Core 2 processors coupled to a Graphics Media Accelerator X3000. Their tests provided a minimum of 41% speedup on video and image processing, with a maximum of 1097% speedup. These tests were conducted on non-integrated hardware which was emulating or mimicking the abilities EXOCHI will finally see, if implemented in hardware. As a result, we should see even greater speedup potential for all kinds of graphics-based algorithms, heavy FP computational algorithms and anything including workloads which can be broken down in parallel.