Nvidia GPU failures caused by material problem, sources claim

Posted by Wolfgang Gruener

Chicago (IL) – When Nvidia announced in early July that it has noticed a higher than normal failure rate in some of its notebook chips, investors reacted concerned, sending the company stock down 22%. The stock recovered after Nvidia apparently demonstrated good control of the issue and a one-time charge of almost $200 million. But what seems to be a closed chapter and a black eye for the company could be a much more serious problem that is just taking off: Several industry sources confirmed to TG Daily what has been reported by some publications for some time: In contrast to Nvidia’s claims that only a limited number of GPUs are affected, sources indicated that “most” recent Nvidia GPUs carry the problem and a chance of failure, pushing the potential damage into stratospheric regions.

We have been chasing the Nvidia GPU problem for quite some time, trying to shed more light on an issue Nvidia refuses to release any meaningful information other than the statement that a limited number of notebook GPUs is affected. Charlie Demerjian from The Inquirer has been reporting for some time that Nvidia’s problem may be much larger than the company admits. Demerjian wrote that, in addition to currently repaired notebooks, G84/6 GPUs may show failures and even G92 and G94 chips could be affected. After several weeks of digging, it seems that Demerjian’s claims may not be as far from the truth as some have claimed. There is a lot of speculation in the market, fueled by Nvidia’s decision not to reveal any details what the source of the problem is. But the general consensus across industry sources we talked to is that a material problem may be the reason for the trouble and depending on whom you believe, between 15 and 75 million GPUs could be affected.

According to our sources, the failures are caused by a solder bump that connects the I/O termination of the silicon chip to the pad on the substrate. In Nvidia’s GPUs, this solder bump is created using high-lead. A thermal mismatch between the chip and the substrate has substantially grown in recent chip generations, apparently leading to fatigue cracking. Add into the equation a growing chip size (double the chip dimension, quadruple the stress on the bump) as well as generally hotter chips and you may have the perfect storm to take high lead beyond its limits. Apparently, problems arise at what Nvidia claims to be “extreme temperatures” and what we hear may be temperatures not too much above 70 degrees Celsius.

What supports the theory that a high-lead solder bump in fact is at fault is the fact that Nvidia ordered an immediate switch to use eutectic solders instead of high-lead versions in the last week of July. Eutectic solders are believed to solve the problem of fatigue cracking. This material is often chosen in such cases as chip designers already have experience with this material. Further out in the future, chip designers will have to consider ROHS exclusions and a transition to lead free bumps using materials such as Tin-Silver. We are speculating here, but a sudden switch of the material could bring additional problems for Nvidia, as such a material switch involving electro-migration requires substantial design work and testing. As a minimum, Nvidia would have to review its power delivery to the chip to avoid high current bumps. We were not able to receive any information whether this has been done or not.

As far as we are told, ATI has been using eutectic solders for some time and appears not to be experiencing a similar problem. However, Nvidia’s sudden switch to eutectic solders may have limited the availability of the material, impacting AMD production and putting actual chip fabs in the middle. There are questions why Nvidia may have missed potential high-lead issues – and may have missed them for quite some time. There is no doubt that all Nvidia chips were tested according to JEDEC rules. Only Nvidia knows why this issue, if high-lead is actually the problem, slipped through.

If we assume for a moment that high-lead is the cause, then there is this question: Which chips are affected and are only notebook GPUs affected? According to our sources, both desktop chips and notebook chips are affected, but the issue is most likely to pop up in notebook chips due to the increased material constraints amplified by the turning on-and-off procedures. We heard that G84, G86 and G92 GPUs could show failures, but we were not able to confirm G94s. Technically, Nvidia would have to replace all those GPUs and the total number is somewhere north of 70 million. But since the issue tends to show up only in notebooks, it is unlikely that there will be any desktop replacements and therefore we are looking at a number closer to 15 million (notebook) GPUs. Take into account that the repair of such a notebook will cost Nvidia at least $150-$250 and you have a damage that could easily be in the billions of dollars.

At this time we only know that Nvidia has made a switch from high-lead to eutectic, everything else is speculation as long as it is not confirmed by Nvidia. However, the detail of information relating to the material switch is surprising and lends a certain credibility to these sources.

The other question, of course, is how often and in which cases those GPUs actually fail. If Nvidia is right and there are in fact low failure rates, then the $200 million that were allocated to repair affected notebooks should be appropriate. If we assume that Nvidia pays about $200 per repair and that 100% of the potential damage is in the neighborhood of $3 billion, then Nvidia’s $200 million allocation suggest that substantially less than 10% of (notebook) GPUs are showing failures.

A big problem would be if failure rates are in fact higher than expected and Nvidia is trying to contain the problem by playing it down and avoid a massive recall that could inflict a lot of damage to the company’s finances: $3 billion is almost twice of what Nvidia currently has in the bank.

So, what does this mean to you? Obviously, only Nvidia knows how serious the problem really is and there is virtually no way of telling whether your Nvidia-based notebook with an affected GPU will show failures or not, as this will depend on the temperatures the GPU will reach. If it shows failures, however, you should contact your vendor and ask for a replacement, provided you are still covered by a warranty.