Upgrades are a failure opportunity

Some of my favorite stories about software or systems failures happen when a system is upgraded. My personal low happened when I worked at LAN Magazine in the mid-1990s. I was working in our LAN Lab, and noticed that the firmware on our main NetWare 3.x infrastructure server, called FS1, was very out of date. So, I installed new firmware, and rebooted the machine. It didn’t come up. We tried everything. It didn’t come up, until we dug deeper, and ultimately rebuilt the boot partition.

FS1 had been running continuously for a number of years; it was on a big UPS, it was a super-stable machine (a Compaq ProLiant, if memory serves) and there was absolutely no reason, until my upgrade, to restart NetWare. Nobody even remembered the last time it had been restarted. Over the years, the disk partition that contained the bootstrap code had become corrupted. Thus, a machine that wouldn’t restart, all because I did an (unnecessary) upgrade.

Remember the BlackBerry system failure in mid-April, just about a month ago? Caused by an upgrade that went wrong.

Remember the big crash of the AT&T telephone network in January 1990? Caused by an upgrade that went wrong.

Yesterday, I received an email from a friend of mine, who works at a Web design shop. She, her programmers and her company’s QA team was stymied by intermittent failures of a new app that they were deploying to customers. She wrote,

“Our first thought is that this is browser related, as some updates have been made to IE recently. There is nothing in common between those who are getting errors, and all are in separate locations. Do you know of anything on the wire that would cause a site to not open in some places but open everywhere else? Firewall issues? Windows updates?”

I couldn’t think of anything, but today she reported that it was the fault of a midmatch between a couple of load-balanced servers:

“The problem was solved. The team was able to replicate the problem and the problem was fixed. One configuration file was corrupted on one of our servers (i.e. the hosted servers). We don’t know when this happened but is possible that happened when the Microsoft update was installed on these servers on Thursday. We have to load balanced servers for the database access, i.e. there is a router that sends the requests from the application to one or the other. For all these people that had problems, the requests were sent to the server with problems for those who didn’t experience problems were sent to the first. This is one of those cases when many things happened the same day and we associated this with the Microsoft upgrade on the client computers, which was not the case.”

Was it the software upgrade? Was it the load balancer? Hard to know, and this certainly isn’t a knock on Microsoft. However, whenever software upgrades happen… bad things often happen.

Z Trek Copyright (c) Alan Zeichick