Polonius

To thine own self be true

Single point of failure

Posted by Polonius on 5 October, 2006

Yesterday’s edition of The Daily WTF told a tale of some HVAC maintainers who managed to close down a bank’s wire transfer capability by taking down both air conditioners for preventive (sic) maintenance. It seems slightly implausible, because hardware is usually significantly more tolerant of high temperatures than wetware, but never underestimate the stupidity of HVAC maintainers. I remember, many years ago, two people staggering out of a computer room with splitting headaches after a trained monkey had left a hose flapping about while doing some maintenance. Apparently the people in the room hadn’t acquired the knack of living on an atmosphere consisting largely of freon.

But I digress. One of the bank’s major errors was of course having a Single Point Of Failure (SPOF). For all their redundant systems, the servers were still all in one room. It is bizarre that many of the UK’s ISPs have their main server farms in one building, at Telehouse Docklands. I hope the ISPs have mirrored facilities on other sites. I’m sure Telehouse has no end of redundant electrical supplies, UPSes, generators, inert gas extinguishers, etc., etc., but it’s still one building.

You can design systems to withstand all predictable failure modes, but if you have a SPOF, you can be confident that an event, or combination of events, that you didn’t predict will catch you unawares, and nobody knows that better than Telehouse themselves. In September 2001, it is claimed that 70% of all Internet traffic between the US and Europe passed through Telehouse’s premises at 25 Broadway. When the World Trade Center collapsed, it took out Telehouse’s mains electrical supply. Telehouse’s backup systems operated as designed. The UPSes picked up the load without missing a beat and kept everything ticking over until the generators could be brought on-line. They had enough diesel to keep the generators running for three days. Nobody had predicted a scenario where the electricity supply would be cut off and fuel delivery trucks would be banned from Manhattan Island for over three days. But it happened.

Strings were pulled; a tanker got through. But then another unpredicted event occurred. Just as people can’t breathe freon for more than a few minutes, diesel generators can’t breathe smoke and dust for more than a few days. Telehouse Manhattan was off-line for over 30 hours.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: