Until now, programmers have been looking up to Moore’s law to speed-up their sequential program with each subsequent processor. But due to the physical limitations in chip design, the programmers’ “free lunch” is over: Multi-core architectures – multiple CPUs on a single chip – bring improved power efficiency, while each of the cores could be slower than the latest unicore processor when running traditional applications. This not only means that the acceleration of sequentially written applications may stagnate – they might even slow-down1. Rajesh Karmani discusses challenges, dead-ends and possible directions for developers.
One important question, of course, is what kinds of applications we can and need to parallelize. How much can we parallelize our word processor or the virus scanner? Scientific applications, simulations, numerical computations have been the focus of parallel computing community for some time. But as parallel platforms (like multi-core CPUs) become mainstream, mainstream applications and their programmers have to rise to the challenge. It is an interesting discussion in itself and demands a separate article which I’ll leave aside for later.
Unfortunately, the problem gets worse when chips with hundreds of cores (sometimes called manycore) are rolled out, as is being widely predicted. Two potential consequences for programmers are:
1) Programs should continue to scale (run faster) with the increasing number of cores.
2) Programs based on shared memory model will suffer serious bottlenecks on memory bus.
Shared-memory concurrent programming (in Java)
Java is a widely-used programming language. It supports concurrent programming through threads and it seems that Java’s concurrency features are an additional feature instead of setting out with the goal of being an elegant concurrent programming language. This shows up in quite a few of its constructs. For example, consider the following statement in Java:
….
x++;
….
Depending on whether x is a local variable or class/instance variable, this statement gets compiled into 1 or 3 bytecode instructions. Crucially, in case x is an instance variable, the sequence of 3 instructions read the value from the heap, increment it, and write it back to the heap. If the code gets called from two different threads, there is a small window such that one increment is observed instead of two increments, which is incorrect. Of course, they told us in their specification what can go wrong, but I am arguing for constructs and abstractions which are simple, intuitive and elegant.
Similarly, the locking mechanism in Java is bug-prone as illustrated by a commonly referred example. This piece of code was part of the StringBuffer class:
public synchronized StringBuffer append (StringBuffer sb)
{
int len = sb.length();
…
…
sb.getChars(0,len,value,count);
…
}
Although the method is declared as synchronized, another thread can modify ‘sb’ object after the first statement such that its length decreases to less than ‘len’. Therefore, when getChars is called on ‘sb’ with the old value of length, the program raises an exception.

Once again, the semantics of synchronized methods in Java are not intuitive enough. On entering a synchronized method, the runtime obtains a lock for the object on which the method is being called. These semantics guarantee isolation of this block of code (method body) with respect to other blocks capturing lock on the same object. This raises multiple problems; what if the programmer forgets to obtain the lock at some place and what about the consistency of other objects (‘sb’ here) being accessed (and modified) inside the synchronized method.
In order to fix this code, programmer should have obtained another lock on ‘sb’ by using the synchronized block construct:
public synchronized StringBuffer append (StringBuffer sb)
{
synchronized (sb)
{
int len = sb.length();
…
…
sb.getChars(0,len,value,count);
…
}
}
But things are not as simple as they seem. Such an implementation could result in deadlock. Consider the following two threads:

There is a window where Thread 1 obtains the lock for sbObj and Thread 2 for sb. Each of the threads will keep waiting for the other thread to release the lock for respective object before it can proceed forward. In fact, the deadlock could occur trivially without the modification since the methods length() and getChars() are declared as synchronized in StringBuffer. For a more detailed discussion on this problem, see references 6 and 7 below.
To conclude, shared memory concurrent programs can have data races, which are conflicting accesses to shared variables. Low-level mechanism such as locks, semaphores provided in languages like C/C++/Java to prevent data-races put too much burden on programmers and yet can cause errors like deadlocks. Reference 8 details the problems with shared memory programs.
Read on the next page: Software Transactions, Conclusion