Is Microsoft’s Azure permanently broken?

There appear to be some serious issues with Microsoft’s Azure cloud services and some experts suggest the problems might be difficult if not impossible to fix.

Last month we reported that Azure was having problems. According to the Microsoft Azure status page there were 38 separate incidents between July 15 and August 15, and apparently things haven’t improved at all. In fact the problems have gotten worse. Since August 15th there have been 51 additional problems ranging from logging in issues to full service interruptions.

Obviously Azure isn’t the only cloud game in town and as soon as we posted the story we started getting feedback from some of Microsoft’s competitors – companies with cloud services that have been around a lot longer than Azure and haven’t had nearly the same number of issues.

We had the opportunity to interview CloudSigma’s CEO Robert Jenkins and we asked him some tough questions. Here’s what he had to say about the Azure issues.

What do you see as the main problem(s) with Microsoft’s approach?

Microsoft takes a very hands-on approach to managing their customers’ computing, as well as offering a lot of PaaS products alongside their IaaS offering. The promise is to deliver less hassle for users, but the flip side of that is a greater abstraction for users from the core underlying computing. This means that it’s harder for customers to alter their computing to achieve the best performance and stability.

Do you think these disruptions will be an ongoing/diminishing/increasing problem for Azure?

It seems that Azure has some system-wide issues that are going to be very difficult to back out of now that it is a live running service with a lot of customers relying on it. Changing the fundamental building blocks around storage, networking and computing are fiendishly difficult in a multi-tenant environment. Even if Microsoft Azure comes up with solutions for the problems they have been facing, fixing them will be a long process that won’t resolve itself overnight.

One of the key challenges I see facing Microsoft Azure is the way they have created a lot of connections and correlations between their services on a global basis. Azure offers a lot of features to automatically manage things like failover and load balancing between locations. These are great features if you have a small outage affecting a sub-set of your users.  However, when you have cloud-wide or system-wide failures, it means customers can’t create truly independent failover solutions. If all customers in Azure are using the same failover mechanism (which they generally are), then a large outage quickly spreads and causes issues between their many locations.

This is exactly what we have seen. It means users can’t really escape an outage to gain high availability, if the outage is coming from Microsoft. This is a serious issue for the enterprise customers that Microsoft has been targeting for Azure.

Could they have mitigated the problems by doing something differently?

Yes, if Microsoft Azure created independence between its different locations and systems, then its customers could deploy across those independent systems and locations. In other words, the correlation between problems in one location would be very low compared to problems in another location. That’s how to achieve true high availability and quality service delivery. If Microsoft Azure had taken the approach of giving greater visibility to its customers on connections between services and locations, then customers would be able to avoid downtime.

All systems fail and customers understand this. But, when customers deploy thinking they have proper failover in place and don’t, then that is when you see problems.

What do you do differently than Azure (or AWS or any of the other cloud service providers) that avoids these sorts of problems? And on average what are your disruption/slowdown numbers compared to Azure?

We concentrate on transparency of system makeup and architecture, we also offer customers powerful tools so they can create resilient cloud infrastructure that can weather any outage. Our API and Web app has specific features that enable our customers to create separation and failover relationships with their storage and computing elements. Additionally, we specifically silo individual locations so we don’t have contagion if/when any system issues occur.

As a result of this approach, we deliver a very high level of uptime. For example, we have never had a multi-location outage in the history of our operation. Even within a single cloud location, systems are modular and independent. It means that when we do have failures, they have a smaller blast radius that rarely affects more than a small subset of our customers’ infrastructure. This allows them to always stay up and running even through outages within our cloud.

I’d like to thank Mr. Jenkins for his time and candor about the Azure situation. As our own Massoud Marzban, VP of Business Development‏ at Burnside Digital said at the time, “A lot is riding on Microsoft’s continued push into the cloud space.”

Let’s hope Microsoft can get things straightened out soon – assuming things can be straightened out.