When any service failure occurs it causes disruption to customers. When one of the most high profile of services – Skype – failed before Christmas it affected millions of customers. The cause, we now know was partly due to a failure of the way the Skype network is architected and partly due to buggy software.
How Skype chooses to deal with these issues will have a significant impact on its plans to become a listed company and attract paying business traffic which will expect resilience and stability in the network.
Lars Rabbe, CIO, Skype (http://blogs.skype.com/en/2010/12/cio_update.html) has produced a good post mortem of the problem and outlined what went on and how Skype intends to try and prevent a repetition of this whole problem. On the whole, the blog post is a frank and open document but it does leave open some serious questions for Skype over its software testing and delivery processes.
In his blog, Rabbe explains how the network architecture uses supernodes to route traffic. He also admits that 25-30% of those nodes were running a known buggy version of Skype’s own software. When these machines went down they literally dumped their traffic onto the remaining servers and that caused a cascade failure which took the network down. This form of cascade failure occurs in any form of network system including railways, roads and data networks.
The big concern is that Skype knew about the bugs in the software and had not forced those machines to be updated to a shipping version that fixed those bugs. Although there was a software fix available it wasn’t an automatic download installed when users next connected to Skype. This means that the outage was self inflicted.
To admit that it
Rabbe has said that Skype will look at how it delivers automatic downloads in the future but has avoided committing to the enforcement of patch fixes to users. This makes no sense at all especially given the high profile failure, the admission that they knew about the bug and that the latest version could have prevented the problem.
In his blog, Rabbe also says that Skype will now be reviewing its testing processes to see if they could do a better job in detecting bugs. It is rare for any company to make that sort of public statement even if we all hope that this is what they are continually doing. The real challenge here for Rabbe and Skype will be how they follow through on this.
Having made this statement, people will be watching to see what Skype does not. At the very least, Skype will need some form of public bug tracking process to prove that it is doing better. If it fails to do that, then the next failure, and it’s odds on that there will be one, will undo all their efforts to improve the quality of their software.
But what of that all important group – business users?
Rabbe’s blog sends out a very mixed set of messages. The first is that Skype knew it had a problem and did not take it seriously enough to prevent a network failure. The second is that even when it has updates to the software to fix serious bugs, it won’t force users to accept them and is prepared to continue with risks to the network. The third message is that we are not really ready to be a high quality replacement for your landline services and should only be used as a backup or where you are prepared to live with service replacement.
On the positive side, the admission of failure, the fairly detailed post mortem blog and a commitment to do better shows that they have, at least, recognised there is an issue.
Will they avoid another failure? Given the history of software vendors, it is unlikely but who knows. Maybe they can find the magic that turns average software into a gold plated quality solution but I’m not holding my breath.