How to program unreliable chips
As transistors get smaller, they also become less reliable. So far, computer-chip designers have been able to work around that problem, but in the future, it could mean that computers stop improving at the rate we’ve come to expect.
A third possibility, which some researchers have begun to float, is that we could simply let our computers make more mistakes. If, for instance, a few pixels in each frame of a high-definition video are improperly decoded, viewers probably won’t notice — but relaxing the requirement of perfect decoding could yield gains in speed or energy efficiency.
In anticipation of the dawning age of unreliable chips, Martin Rinard’s research group at MIT’s Computer Science and Artificial Intelligence Laboratory has developed a new programming framework that enables software developers to specify when errors may be tolerable. The system then calculates the probability that the software will perform as it’s intended.
“If the hardware really is going to stop working, this is a pretty big deal for computer science,” says Rinard, a professor in the Department of Electrical Engineering and Computer Science. “Rather than making it a problem, we’d like to make it an opportunity. What we have here is a … system that lets you reason about the effect of this potential unreliability on your program.”
Last week, two graduate students in Rinard’s group, Michael Carbin and Sasa Misailovic, presented the new system at the Association for Computing Machinery’s Object-Oriented Programming, Systems, Languages and Applications conference, where their paper, co-authored with Rinard, won one a best-paper award.
On the dot
The researchers’ system, which they’ve dubbed Rely, begins with a specification of the hardware on which a program is intended to run. That specification includes the expected failure rates of individual low-level instructions, such as the addition, multiplication, or comparison of two values. In its current version, Rely assumes that the hardware also has a failure-free mode of operation — one that might require slower execution or higher power consumption.
A developer who thinks that a particular program instruction can tolerate a little error simply adds a period — a “dot,” in programmers’ parlance — to the appropriate line of code. So the instruction “total = total + new_value” becomes “total = total +. new_value.” Where Rely encounters that telltale dot, it knows to evaluate the program’s execution using the failure rates in the specification. Otherwise, it assumes that the instruction needs to be executed properly.
Compilers — applications that convert instructions written in high-level programming languages like C or Java into low-level instructions intelligible to computers — typically produce what’s called an “intermediate representation,” a generic low-level program description that can be straightforwardly mapped onto the instruction set specific to any given chip. Rely simply steps through the intermediate representation, folding the probability that each instruction will yield the right answer into an estimation of the overall variability of the program’s output.
“One thing you can have in programs is different paths that are due to conditionals,” Misailovic says. “When we statically analyze the program, we want to make sure that we cover all the bases. When you get the variability for a function, this will be the variability of the least-reliable path.”
“There’s a fair amount of sophisticated reasoning that has to go into this because of these kind of factors,” Rinard adds. “It’s the difference between reasoning about any specific execution of the program where you’ve just got one single trace and all possible executions of the program.”
The researchers tested their system on several benchmark programs standard in the field, using a range of theoretically predicted failure rates. “We went through the literature and found the numbers that people claimed for existing designs,” Carbin says.
With the existing version of Rely, a programmer who finds that permitting a few errors yields an unacceptably low probability of success can go back and tinker with his or her code, removing dots here and there and adding them elsewhere. Re-evaluating the code, the researchers say, generally takes no more than a few seconds.
But in ongoing work, they’re trying to develop a version of the system that allows the programmer to simply specify the accepted failure rate for whole blocks of code: say, pixels in a frame of video need to be decoded with 97 percent reliability. The system would then go through and automatically determine how the code should be modified to both meet those requirements and maximize either power savings or speed of execution.
“This is a foundation result, if you will,” says Dan Grossman, an associate professor of computer science and engineering at the University of Washington. “This explains how to connect the mathematics behind reliability to the languages that we would use to write code in an unreliable environment.”
Grossman believes that for some applications, at least, it’s likely that chipmakers will move to unreliable components in the near future. “The increased efficiency in the hardware is very, very tempting,” Grossman says. “We need software work like this work in order to make that hardware usable for software developers.”