Sunnyvale (CA) - AMD has unveiled a new supercomputing-targeted stream processor card that not only is the first of its kind to deliver double-precision capability, but also is tied to new low-level and high-level SDKs that promise to simplify general purpose GPU (GPGPU) application development.
AMD's new 64-bit product is a stream processing engine called FireStream 9170. It's built atop AMD's CTM technology with something called CAL, or Compute Abstraction Layer. This software development layer takes the lower-level functionality and provides a way to use it for higher-level constructs, like 64-bit computing. AMD will be providing a kit and products in the first quarter of 2008. The card will retail for $1,999, provide 500 Gigaflops of throughput from a 55nm process technology consuming 150 watts per card. Each card supports 2 GB of GDDR3 memory and DMA (Direct Memory Access, which allows the card itself to read/write from memory without needing the CPU to copy data).
AMD has also committed to bringing developers additional tools which will help them write and debug usable code more quickly. These tools will be released at a higher level. The Brook Project, from Stanford University, for example, is now going to be marketed by AMD as Brook+. In addition, AMD is planning a new release of their AMD Core Math Library (ACML) with GPU-accelerated math functions built-in. Their COBRA video library will also provide acceleration for video algorithms. They're announcing several third-party tools will also provide support, including products from RapidMind and Microsoft.
Background on floating point
32-bit floating point values are highly desirable for many applications because they can be computed very quickly with about 8 significant digits in base-10 math (the numbers we all use), typically in one clock cycle. Such math is more than adequate for 3D games and fast graphical algorithms, hence their wide use in graphics cards. Jumping to 64-bit math, however, greatly increases computing time. 64-bits aren't necessarily required for graphics, however, as high frame rates are usually more desirable than perfect pictures per frame. And, over several dozen frames of movement, the limitations present in 32-bit computations will be averaged out by the eye. Still, 64-bit computations are notably more accurate.
x86 FPU
There are three basic types of floating point formats supported in x86's FPU. These are 32-bit single precision (7-8 significant digits), 64-bit double precision (14-15 significant digits), and what is called 80-bit extended double precision (18-19 significant digits). 80-bit precision is only available on the FPU, and the engine actually uses a larger-than-80-bit internal representation making very accurate computations up to about 19 significant digits. The FPU is typically not used because its architecture is somewhat antiquated and clunky compared to the streaming SIMD instructions present in today's ISA extensions, like MMX, SSE/2/3/4/5, etc.
When the FPU was originally designed for the PC, it was built to solve a different kind of problem than today's streaming extensions. The original FPU operated on a machine where the need for parallel computing wasn't really a priority. It needed to crunch the kind of data that was required at that time quickly, using high precision. However, cost was a factor. When the FPU was engineered, computers were not cheap. They would be the equivalent of $20,000 or more today. As such, the FPU designers built their product as a co-processor add-on, something that could be purchased separately, the 8087 math co-processor chip.
The 8087 chip ran alongside the 8086 CPU. It physically sat in its own socket and monitored the CPU's instruction stream. It decoded every instruction just like the CPU. Whenever it found a CPU instruction it simply waited for the CPU to carry out its work and continue to the next instruction. Whenever it found an FPU instruction it carried out its own work while the CPU waited. In this way, both were working in harmony.
It wasn't until the 80486DX that the FPU was brought onto the same die. Until then, the math co-processor evolved from the 8087 to the 80287 and 80387. The 80486SX could use an 80487SX math co-processor, which was really an 80486DX with a different pinout. It had the consumer annoying trait of disabling the 80486SX whenever it was present, and actually carrying out the entire CPU + FPU work in the different socket. Ever since the 80486DX the FPU has been on the same die as the CPU. This has introduced several speedups and made the cost of ownership notably less.
Because the original 8087 was a separate chip, and because the 8086 had a need to provide floating point math abilities, even though it was only an integer engine, the 8086 designers introduced a "math co-processor not present" interrupt. This interrupt would halt the CPU's execution of the normal software program whenever a floating point math instruction was found. It would branch to some appropriate code whereby the math abilities were emulated. While this worked, and was ultimately used, by the way, to find the original Pentium FDIV bug, it was very slow. Sometimes on the order of 10x to 100x slower than dedicated silicon.
Still, some needs exist today for 80-bit math. Most modern compilers support only 32-bit and 64-bit (single and double) floating point formats because these are the only ones typically supported in other hardware. Compiler writers like to keep things simple, so they make a compiler engine which targets all of the available platforms they're after, leaving out powerful engines like the 80-bit floating point format in the FPU. Some still have it, and all modern x86 PCs still support the 80-bit format, though there have been some speedup considerations given over time. For example, the Pentium 4 introduced some legacy support bits which, if enabled, provided full legacy support so the FPU would act just the way it used to. However, if they were disabled (which they were by default from the factory), then the FPU could process data more quickly. This caused some problems for DOS programmers who used the legacy information in their programs. It required them to alter the state of the FPU to enable the legacy support bits.
The FPU has typically been viewed as a very difficult engine to program. It was not built to allow direct access to data items, but rather everything operates through something called a stack. You push data onto the stack, let the FPU compute it, and the pop it back off. This requires some complex analysis and consideration to determine which operand, result and data items will be in which registers during various operations. The FPU comes with POP and non-POP instructions. These allow the programmer to work more efficiently with the FPU's stack nature.
An example could be envisioned as a two-drawer filing cabinet. We'll call them drawers 0 and 1. If a value is loaded onto the FPU stack, it's like putting it into drawer 0. If another value is loaded into the stack, this is where it gets tricky. Drawer 0 is moved to drawer 1, while the new value is put in drawer 0. So, if we loaded 3 and then 5, drawer 1 would old 3 and drawer 0 would hold 5. Confused yet? Now, we do a math operation. If we add the two values together with a non-POP operation, then we could have 8 and 5, or 8 and 3, depending on which order we added the data. If we used a POP operation, then we'll end up with 8 in drawer 0 and nothing in drawer 1.
This kind of programming model has caused many developers to look for higher-level C or C++ libraries which all them to do it with symbols through the compiler.
AMD's new 64-bit product is a stream processing engine called FireStream 9170. It's built atop AMD's CTM technology with something called CAL, or Compute Abstraction Layer. This software development layer takes the lower-level functionality and provides a way to use it for higher-level constructs, like 64-bit computing. AMD will be providing a kit and products in the first quarter of 2008. The card will retail for $1,999, provide 500 Gigaflops of throughput from a 55nm process technology consuming 150 watts per card. Each card supports 2 GB of GDDR3 memory and DMA (Direct Memory Access, which allows the card itself to read/write from memory without needing the CPU to copy data).
AMD has also committed to bringing developers additional tools which will help them write and debug usable code more quickly. These tools will be released at a higher level. The Brook Project, from Stanford University, for example, is now going to be marketed by AMD as Brook+. In addition, AMD is planning a new release of their AMD Core Math Library (ACML) with GPU-accelerated math functions built-in. Their COBRA video library will also provide acceleration for video algorithms. They're announcing several third-party tools will also provide support, including products from RapidMind and Microsoft.
Background on floating point
32-bit floating point values are highly desirable for many applications because they can be computed very quickly with about 8 significant digits in base-10 math (the numbers we all use), typically in one clock cycle. Such math is more than adequate for 3D games and fast graphical algorithms, hence their wide use in graphics cards. Jumping to 64-bit math, however, greatly increases computing time. 64-bits aren't necessarily required for graphics, however, as high frame rates are usually more desirable than perfect pictures per frame. And, over several dozen frames of movement, the limitations present in 32-bit computations will be averaged out by the eye. Still, 64-bit computations are notably more accurate.
x86 FPU
There are three basic types of floating point formats supported in x86's FPU. These are 32-bit single precision (7-8 significant digits), 64-bit double precision (14-15 significant digits), and what is called 80-bit extended double precision (18-19 significant digits). 80-bit precision is only available on the FPU, and the engine actually uses a larger-than-80-bit internal representation making very accurate computations up to about 19 significant digits. The FPU is typically not used because its architecture is somewhat antiquated and clunky compared to the streaming SIMD instructions present in today's ISA extensions, like MMX, SSE/2/3/4/5, etc.
When the FPU was originally designed for the PC, it was built to solve a different kind of problem than today's streaming extensions. The original FPU operated on a machine where the need for parallel computing wasn't really a priority. It needed to crunch the kind of data that was required at that time quickly, using high precision. However, cost was a factor. When the FPU was engineered, computers were not cheap. They would be the equivalent of $20,000 or more today. As such, the FPU designers built their product as a co-processor add-on, something that could be purchased separately, the 8087 math co-processor chip.
The 8087 chip ran alongside the 8086 CPU. It physically sat in its own socket and monitored the CPU's instruction stream. It decoded every instruction just like the CPU. Whenever it found a CPU instruction it simply waited for the CPU to carry out its work and continue to the next instruction. Whenever it found an FPU instruction it carried out its own work while the CPU waited. In this way, both were working in harmony.
It wasn't until the 80486DX that the FPU was brought onto the same die. Until then, the math co-processor evolved from the 8087 to the 80287 and 80387. The 80486SX could use an 80487SX math co-processor, which was really an 80486DX with a different pinout. It had the consumer annoying trait of disabling the 80486SX whenever it was present, and actually carrying out the entire CPU + FPU work in the different socket. Ever since the 80486DX the FPU has been on the same die as the CPU. This has introduced several speedups and made the cost of ownership notably less.
Because the original 8087 was a separate chip, and because the 8086 had a need to provide floating point math abilities, even though it was only an integer engine, the 8086 designers introduced a "math co-processor not present" interrupt. This interrupt would halt the CPU's execution of the normal software program whenever a floating point math instruction was found. It would branch to some appropriate code whereby the math abilities were emulated. While this worked, and was ultimately used, by the way, to find the original Pentium FDIV bug, it was very slow. Sometimes on the order of 10x to 100x slower than dedicated silicon.
Still, some needs exist today for 80-bit math. Most modern compilers support only 32-bit and 64-bit (single and double) floating point formats because these are the only ones typically supported in other hardware. Compiler writers like to keep things simple, so they make a compiler engine which targets all of the available platforms they're after, leaving out powerful engines like the 80-bit floating point format in the FPU. Some still have it, and all modern x86 PCs still support the 80-bit format, though there have been some speedup considerations given over time. For example, the Pentium 4 introduced some legacy support bits which, if enabled, provided full legacy support so the FPU would act just the way it used to. However, if they were disabled (which they were by default from the factory), then the FPU could process data more quickly. This caused some problems for DOS programmers who used the legacy information in their programs. It required them to alter the state of the FPU to enable the legacy support bits.
The FPU has typically been viewed as a very difficult engine to program. It was not built to allow direct access to data items, but rather everything operates through something called a stack. You push data onto the stack, let the FPU compute it, and the pop it back off. This requires some complex analysis and consideration to determine which operand, result and data items will be in which registers during various operations. The FPU comes with POP and non-POP instructions. These allow the programmer to work more efficiently with the FPU's stack nature.
An example could be envisioned as a two-drawer filing cabinet. We'll call them drawers 0 and 1. If a value is loaded onto the FPU stack, it's like putting it into drawer 0. If another value is loaded into the stack, this is where it gets tricky. Drawer 0 is moved to drawer 1, while the new value is put in drawer 0. So, if we loaded 3 and then 5, drawer 1 would old 3 and drawer 0 would hold 5. Confused yet? Now, we do a math operation. If we add the two values together with a non-POP operation, then we could have 8 and 5, or 8 and 3, depending on which order we added the data. If we used a POP operation, then we'll end up with 8 in drawer 0 and nothing in drawer 1.
This kind of programming model has caused many developers to look for higher-level C or C++ libraries which all them to do it with symbols through the compiler.
Shop Keywords: AMD ATI FireStream stream CTM CAL Close to Metal Compute Abstraction Layer FPU SIMD SSE SSE2 SSE3 SSE4 MMX SSE5 single double extended precision floating point CPU FPU GPU




