Skype outage: An unprecedented wake-up call

Posted by Rick C. Hodgin

Opinion - There are certain pieces of technology that in today's world we all pretty much rely on. Cellphones, PDAs, notebooks, the Internet.  Whenever any of these are down, our lives are notably impacted. For many, the world of IM/VoIP is just such a tool. When Skype went down without a warning and has stayed down for over 36 hours, many users are beginning to think it's a bit much – and unacceptable.

Many Internet companies, especially services we love to use, have a get-out-of-jail-free card. We tend to forgive failures much easier than we would with any other non-Internet company. This continuous outage of the Skype service could be powerful enough event to change this scenario in less than two days.

Skype's online blog has a running status of the system state. So far they have indicated only two main systems are having problems:  Sign in/registration, and peer-to-peer.  That page also has a running blog of their efforts to restore the system.  It reads, "The Skype system has not crashed or been victim of a cyber attack. This problem occurred because of a deficiency in an algorithm within Skype networking software.  This controls the interaction between the user's own Skype client and the rest of the Skype network."  That entry was posted yesterday at 10 pm GMT.  They updated their blog again 4 hours later to say they were well on their way toward a fix. Then 15 hours later to say "...we are gradually moving to new shifts of fresh brains to help out those getting to well deserved few hours of sleep," indicating no real progress because they were still down.  Then they posted again a few hours later.  And again.

The most recent entry (posted at 11 am GMT) now reads:

"We're on the road to recovery. Skype is stabilizing, but this process may continue throughout the day.  An encouraging number of users can now use Skype once again. We know we're not out of the woods yet, but we are in better shape now than we were yesterday.  Finally, we'd like to dispel a couple of theories that we are still hearing. Neither Wednesday's planned maintenance of our web-based payment services nor any form of attack was related to the current sign-on issues in any way.  We'll update you again as soon as we can. Thanks for hanging tight."

As of the time of this writing, Skype has been down for about 36 hours.  We've seen intermittent access, but nothing reliable. No calls, no sign ins, no chats, no ability to keep in close contact with anyone we regularly Skype In/Out with. And it's beginning to cause a burn that may be irreparable.

Let's be realistic.

It's not like Skype is a tool some of us rarely use.  For many Skype In/Out users it's an ongoing component of our business model.  Some even use it as their primary means of keeping in contact with customers, family members and others.  We're at a point where we rely on such technology. We expect it to just work. Yes, it is cheap or even free, but does that matter? It's a communication tool – communication is essential to our lives and if communication does not work as we are used to, there is a serious and disruptive problem. 

I don' think anyone who might be upset today would be out of line for thinking that way. All of us could forgive the occasional 4-hour outage, maybe even 8-10 hours provided it occurred off our normal peak usage. But for 36 hours?  In today's world?  With software redundancy and backup servers? I mean, what is this problem at Skype all about?

It seems very clear this is more than just an "algorithm deficiency."  If the entire Skype network can be taken down for 36 hours for some internal reason, then it's either outside of Skype's control or it would almost have to be by design.  The networking admin folks, completely independently of any software dev teams, should've been able to go back to yesterday's backup and get it back up and running after the first 8-10 hours went by without a software fix/solution.  In that model, maybe we would've lost our most recently added contacts, but it still would've worked.

It really makes one wonder just how fragile the system is and what Ebay is doing to keep it up and running.  Due to the length of this outage I'm beginning to wonder in the back of my mind if there isn't some kind of political struggle taking place within Ebay between the Skype folks (based mostly in England) and the Ebay folks (based mostly in the U.S.).  I doubt that's happening, but one can only speculate.

And if there is no involvement of Ebay in this whole outage, I am certain that Ebay is watching this thing with a very concerned mind. Skype is virtually self-destructing itself in front of Ebay's eyes. Skype might finally turn out to be the $4 billion mistake many thought when Ebay bought it.

These kinds of software outages are extremely rare.  Even when contractors accidentally severe Internet backbone connections with big backhoes and such, they're usually back up in a few hours.  And this "algorithm deficiency" is just amazing in its scope.  I have been trying to think of any large service like Skype, with its 10+ million active users online at any given time, where it has been down for 36+ hours without any solution in sight.  I can't think of any, and certainly none which have affected me personally like this.

We've heard those rumors that a malicious attack could have taken Skype down. We don't know if it did or not. And it doesn't really matter. Skype should be able to keep its service up no matter what happens in the backend. We shouldn't even have to discuss what may have happened.

That brings us to Skype's information policy. The last post updating us on what's going on was at 6 am EST today – almost 12 hours ago at the time of this writing. I believe that people who rely on this service have a right to find out what is going on more than twice a day. It clearly looks like Skype may not understand just how serious this problem is or even how to fix it.  And the frightening part of that reality is it's their system.

I have no doubt that this outage will be a wakeup call for many users, small businesses and service providers. This outage showcases the vulnerability of VoIP overall, a young technology that wants to replace the good old phone. This outage can reach well beyond Skype, it can impact other services such as Microsoft's messengers, AOL's messengers and Yahoo's messenger as well.

Businesses certainly will think this whole thing over and think twice before making what obviously was a mistake again. In a best case scenario, only Skype will lose users and the trust it's built over the past few years.  In a worst case scenario, the confidence in all VoIP messengers will be damaged.  And I think that confidence component will be in the back of our minds for some time to come.

I wonder what the analysts will say once Skype is back up and the dust settles over the reasons why this has happened.  I'd like to hear your thoughts and comments.  Please post them below.

 

 

Update:  August 18, 2007 - 6:00pm EDT

Several commenters have indicated that Skype is free and there is no justification for complaining about a free service.  Skype has more than one model of use.  Computer to computer is free.  Land line to computer is free (SkypeIn).  And computer to land-line (SkypeOut) is a pay-service and one of high value.  Many companies rely on SkypeOut for their primary mode of communication and that's just a fact, be it right or wrong.  Skype offers great advantage in cost savings, costing about 2 U.S. cents per minute.  Skype Unlimited allows potentially even better savings.  Skype also utilizes a computer resource that an employee or user would already have at their desk, along with a common shared Internet access which would also already be wired.  There are several viable reasons to use something like Skype in a business model.

 

The primary complaint of this article is that in today's day and age, where we have redundancy, hot swappable components in servers, such as memory, processors, disks, networking cards, etc., a multi-billion dollar corporation should have a network admin team which would be able to redirect anything that would fail in their system in just a matter of minutes.  In this case, Skype indicated it was a software error.  And after months and months of use with nearly 10 million users per day, for any software algorithm deficiency to rear its head to such a degree that the entire system goes down for two business days is, to my knowledge, unprecedented and it should be unacceptable.  If it truly was a software deficiency, then Skype's software developers weren't doing their job in the first place.

 

In speaking with various associates about this matter, the general consensus was that a one to four hour window of downtime would be acceptable for such a failure.  Anything beyond that and in today's day and age there really is no excuse for it. As this particular event unfolded, two solid business days were affected.  Sixteen hours of downtime can be significant depending on what you do for a business.  A few hours here and there is pretty much acceptable and forgiveable in any business model.  And if it's not, then you're paying a premium for the five nines of up-time.  But apart from that premium, sixteen contiguous business hours moves away from the forgiveable and into the realm of "okay, it's time to rethink".  And I think that's where many people should be about now.

 

Relying on any single-source vendor, be it for VoIP or anything else, is something that's proving undesirable regardless of the cost savings.  At the very least, many people should be looking at what their back-up contingency plan will be.