Santa Clara (CA) – In a world dominated by multi-thread desires, and often single-thread limitations, hardware advancements can make the biggest difference in performance. AMD has released a new extension for x86 hoping to address at least part of that. Dubbed SSE5, this newest generation adds power to the x86 by introducing not only a whole new instruction class, but also powerful multiply-accumulate instructions as well. Both of these advancements should deliver notable savings in compute time.
SSE5 stands for “Streaming SIMD Extensions version 5”. SIMD is a type of compute philosophy which greatly differs from the rest the x86 engine – when x86 was first created, its primary goal was to take something and compute it. It operated off of what’s called the SISD model, which means “Single Instruction, Single Data”. The computer would execute one instruction on one piece of data. SIMD extended that model by allowing a single instruction to compute on more than one piece of data at the same time. It does this in parallel, allowing 2, 4, 8 or 16 computations to be carried out where only one was possible before.
As you might guess, SIMD stands for “Single Instruction Multiple Data”. It relates to the concept of packed values. SIMD supports a wide range of data types. They can be viewed logically like this.
SIMD was first introduced for integers only with MMX. It was then extended to 32-bit floating points with SSE. SSE2 brought 64-bit floating point abilities and more parallel 32-bit operations. SSE3, SSSE3 and SSE4 all brought additional and/or wider compute abilities.
The entire SIMD engine today is very wide and capable. Operands include integer values of 8, 16, 32, 64 and 128 bits. These relate to 16, 8, 4, 2 and 1 one simultaneous parallel operation respectively. For floating point they are either 32-bits or 64-bits, allowing for 4 or 2 operations, respectively:
The concept of horizontal instructions was also added with SSE3:
The Floating Point Unit (FPU) of the x86 architecture also allows for 80-bit floating point values and is “almost” fully IEEE-754 compliant: The FPU tries to maintain additional accuracy by not rounding values internally until data is stored. While this might actually be desirable for true computed numbers, it does not behave predictably when compared to other architectures that are fully IEEE-754 compliant. This reality has forced compiler writers to introduce flags which, on the x86, will go through otherwise unnecessary steps on other architectures, to store and re-load values in the middle of computations to ensure rounding is correct.
The SIMD engines present in MMX, SSE/2/3/4/5 supported some different design goals than those of the FPU. This reality of computation is that often times results are in overflow and underflow conditions where the exact result cannot be stored. The concept of different wrap-around modes therefore was introduced to be able to wrap or saturate the result with its maximum or minimum value when overflow or underflow occurs. Three ways to handle overflow operations were introduced: wrap-around, signed saturation and unsigned saturation.
For example, if two 8-bit values of 200 + 200 were added together, the result would be 400. That’s too big to fit in a single 8-bit destination which can only hold a maximum of 255. So the SIMD saturation engine would kick in and store the maximum allowable result of 255. The rest of the x86 engine would handle this addition of 200 + 200 differently. It would set the overflow flag and only store the last 8 bits. Saturation allows very fast parallel compute operation, but it is not accurate. For operations that may saturate in this way, the next largest operand must be used (such as using 16-bits instead of 8-bits for these computations).
Both AMD and Intel are looking primarily at future software needs when they consider which way to move with hardware advancements. The recognition that future software will benefit from parallel operations is an absolutely paramount realization. AMD is looking at compute-intensive, multi-media and security applications with SSE5. It is are targeting a wide industry adoption through many software vendors. And full tool support is expected to be available in 2008, including a fully-supported GCC compiler.
Read on the next page: What you can expect from SSE5
What you can expect from SSE5
All of the existing SIMD instructions (from MMX, SSE/2/3/4) relate at most two operands and some operation, such as add, subtract, multiply, divide as well as logical operations like AND, NOT and NOR. AMD’s new class of operations now involves three input values, with the third being also the destination. These relate to permutation, conditional move, vector compare/test as well as precision control, ronding and conversion instructions.
In one example given by AMD, the addition of a third operand reduced by over 40% the required number of instructions for a common 4×4 multiply matrix operation. The same work is being completed under the new model, but the introduction of the third operand allows the output from the computation to be directed to some new place. This new ability speeds processing up as it doesn’t require the data to be shuffled around in registers between operations.
SSE5 also introduces a new multiply-accumulate ability for both floating point and integer. A multiply-accumulate instruction takes what would ordinarily be two operations and combines them into a single operation. One computer instruction, more than one thing happening.
All told, 46 new compute abilities have been introduced with SSE5. These account for a total of 170 new computer instructions when you consider all operand sizes, as well as operations which read/write from memory as well as the internal registers.
These new SSE5 extensions will not be seen in silicon until AMD’s upcoming Bulldozer microprocessor, which is due out in 2009. These extensions will also not be present in Bobcat, AMD’s low-wattage version of Bulldozer, due out about the same time. It’s somewhat of an enigma to us why AMD has released these specs at this time, close to two years before silicon is actually available for general use.
New Tools – SimNow, CodeAnalyst and AMD Core Math Library
AMD has released its SimNow emulation software – the firm’s support software for any early adopters – with updates to support SSE5. SimNow requires an AMD processor, but it runs an emulated environment on either Linux or Windows. It is useful for writing and testing code as the SimNow engine is expected to behave identically to the processor when it is released. SimNow is not a suitable replacement for a real processor though, as it is emulated and notably slower. It is a good platform for development, but not for production.
AMD also released their CodeAnalyst software which includes SSE5 support. This program is part of a suite of tools designed to help the software developer find trouble-spots in their program. It will look at the code and, knowing what it does about the way AMD64 processors compute, it will identify places where the operations might want to be arranged slightly differently to result in greater performance. Code analysis tools like these can often be invaluable to the programmer. There is a 5% / 95% rule in programming, which says that only 5% of the entire program is responsible for 95% of its computing. If Code analysis tools can help that 5% become only 1% faster, then it makes a big enough difference to be worthwhile.
AMD also released their updated Core Math Library which includes support for SSE5 instructions. When using this library with the SimNow engine and the CodeAnalyst analyzer, better and faster ways of carrying out the same operations developers need today can be tested, optimized and brought up to “Ready-One” for the day when AMD processors are actually released with this technology in silicon.
Programmers can download these new tools from www.developer.amd.com/sse5. Support for RedHat, SuSE Linux with either 2.5 or 2.6 kernel, as well as 32-bit and 64-bit versions of Windows.
SSE5 extends performance boundaries with new, combined and three-operand instructions. AMD’s efforts in the x86 arena reveal a very clear focus and intent. AMD is looking at the needs of the software industry and bringing forth hardware which addresses many of those needs. AMD’s Light-Weight Profiling (LWP) initiative is another example which provides, via hardware, useful tools. LWP provides a way for software developers to know things they might not otherwise be able to know, and certainly without custom-developed, complex and costly runtime analysis add-ons.