Horse Sense #104

A Lesson in Business Continuity

 

Friday, 6/29/2012, a derecho, a thunderstorm with sustained high straight line winds, hit Virginia, Maryland, DC, Ohio, and some other states.  Large trees and limbs were toppled taking power and communication lines down with them.  Millions were without power and numerous deaths were attributed to the storm.  Few businesses were *directly* impacted, though their electric, Internet, and phone providers were.

 
I am not a fan of the term disaster recovery.  When you hear that, you think about a meteor taking out the whole building, not a key person having appendicitis or a derecho.  I like to think in terms of business continuity.  Business continuity is all about being able to operate, perhaps in a degraded fashion, when things go wrong.

 
We did not have a whole lot of warning.  The timing for many businesses in the Washington, DC area was "good."  The storm hit my home and office at 10:30PM on Friday 6/29 and was gone by 11:30PM.  The next week included the July 4 holiday on a Wednesday, so lots of people were taking time off that week anyway, though Iron Horse was supposed to be open and all its personnel were in town.

 
This taught me the following lessons:

 
(1) Even what you think is a well laid plan will not work if something else it depends on is not available.  The basement of my home flooded because the torrential rains filled up the basement stairwell. Unfortunately, the sump pump at the bottom could not pump the water out because it had no power.  I also could not see what was happening because my six year old had had fun with some of our flashlights and they had run down, the bulbs had burnt out, or he forgot where he put them after taking them from where they were supposed to be.

 
(2) Even if you test your plan, and everyone knows the plan, some of it might fail.  We had those flashlights working a couple of months before.  Though my six year old knew about the plan, those flashlights were just an irresistible draw.

 
(3)  Assessment is a necessary first step before you attempt to remedy a problem, but remedies and even attempts at assessment can create further problems...  When I first opened the stairwell door to check on the water, the power was operating and I could see everything was OK.  The second time, the power was out and I could not see clearly.  I opened the door and a wall of water came in.  Oops!

 
(4) Be safe.  Take care of your people.  I wanted to see how bad stuff was and maybe bail out the stairwell, but my wife (wisely) did not let me go out in the storm.  I was upset at having to mop up the basement and did not want any more flooding, but going out in the storm would not have been one of my brighter moves.

 
(5)  Think creatively.  I had two uninterruptible power supplies in my house.  One was connected to my TV, cable, and Internet.  They continued working when the lights went out so we found out what was happening. But that one ran down quickly.  The other one was connected to a computer that was powered off.  I picked up that UPS and put it on a stable stool next to the stairwell door and plugged it into the GFI circuit (no, I do not have a death wish and am very careful with electricity!) on the wall that had the sump pump plugged in.  Then I plugged the sump pump into it.  Yeah!  The water immediately pumped out of the stairwell!

 
(6)  Give up and take care of yourself.  It was late at night.  The storm had passed.  I had sucked the water out of the stairwell.  We had no power and I could not see whether other damage had been done.  So the wife, Fluppy the Puppy, my six year old, and I went to sleep in one bed.  Maybe sleep is not the right word to use with a six your old and a dog in your bed....

 
(7)  Use what works.  I could have used my watch or my phone to work as an alarm clock, but six year olds tend to get up early anyway....

 
(8)  Luck does not hurt.  Our power was on again by 7AM, but many people in our area did not have power for a week.  It was also the weekend, so work was not an issue.

 
(9)  Gather information and reassess as necessary.  I found out the area around my office had been especially hard hit by searching the Internet.  One man was killed by a falling tree on a major road nearby.  I knew the entire area would be a mess.  It being Saturday at that point, I resolved not to even try to get to work even though my web site and e mail were down.  Getting in the way of the work and emergency crews was a bad idea.  By Sunday morning, the connection to the office was back up and so was the web site and e mail.  Reports from the area were still bad, so my staff and I stayed away on Monday and worked from home.

 
(10)  Fixes can cause their own problems or require new plans.  When our street got its power I started hearing a loud hum like that of a high power piece of equipment.  It was not in my house.  At the end of my street, a downed wire was arcing and sending flames 20 or more feet into the air.  The fire department came and closed off the street, but the power crews were not able to cut the power for over an hour.  Cutting that power blacked out part of my neighborhood for many days.

 
(11)  Pool your resources.  One of our friends was in an area without power for days.  The following week was unbearably hot, so we invited her family over and they charged up their cell phones and tablets, slept in our air conditioning, and used our Internet connection (she often works from home, but it had no power).

 
(12)  Travel may not be an option, so teleworking can save the day. Trees were down everywhere.  Power lines were down.  Stoplights were dark.  Travel was iffy, especially in the area near the Iron Horse offices.

 
(13)  Other people can make their problems yours.  When I finally got back in to work, I found a very large tree had snapped off about 20 feet up and fallen on the roof of the business located directly above Iron Horse.  We had to have our cars out of the way of the crane coming in, the possibility of a flood upstairs flooding us as well, and the tree or part of it falling on to our brand new heat pump when they tried to remove it.  Fortunately none of that happened, but we prepared for it.

 
(14)  It may not be over when it is over.  Power blinked on and off at the office multiple times like it did at my house as the storm went through.  There were also overvoltages which my uninterruptible power supplies (UPSs) handled.  Unlike my house, the power never failed for an extended period, so my UPSs kept all the equipment working.  However, the power did blink on and off after the storm for days afterwards.  I observed it happening at my desk, but at least the UPSs kept us working.  There were brownouts (low power) events as well.  These mini outages and brownouts probably occurred as crews in the area powered on parts of nearby grids.  Brownouts, blackouts, and overvoltages can cause hardware damage or data corruption.  Just a few days ago, the UPS that had protected the machine at my desk registered a battery failure, so now I need to replace its batteries.  Still, it did its job.  On occasion, I have seen UPSs fail because though the electronics read the batteries as being OK, they were not and failed.  [If you are wondering at this point whether your UPS is up to the task, ask us.]

 
(15)  Backup and restore is not just about computers, it is about people, communications, places to work, and other resources as well. When our Internet link failed our phone lines went with it, but they were automatically redirected to cell phones and we were able to do much of our work at home.

 
(16)  Emergency services may not be available.  When an entire area gets clobbered, you cannot count on emergency services being available to you.  They may be otherwise engaged, be working a larger or more urgent issue, or may have their own issues.  Fairfax County citizens not only lost their dial tone, but the 911 service and its backup also got knocked out.  Yes, it was not supposed to happen....  This is a very good reason for intentionally delaying recovery efforts as part of a continuity plan.  If you get in the way of emergency crews or you need their assistance, you have a big problem.  Sometimes just sitting back and saying, "We're hosed.  Everyone's hosed.  Let's relax." is best.

 
(17)  Sometimes issues cascade and you must deal with those.  Right after the power went down the temperatures jumped up.  This hampered emergency crew response times, but also made it impossible to safely use many electronic devices.  It was simply too hot and humid for them to operate.  Computers do not behave well in hot, humid, un-air conditioned environments.  The heat was so high it bent some train tracks and took out some electrical switch gear.  High heat and lack of air conditioning made it imperative that many people find cool shelter.

 
There were some other failures of equipment.  A local movie theater with power had a lot of disgruntled customers after the storm because they could not show some of their movies.  That was because movie theaters now get digital copies of the movies they show and their servers and networks were "acting up."  In other words, they did not have proper UPS protection or equipment.  Even after they got power back, recovery procedures and smaller power glitches kept them from being able to do business.

 
(18)  Help your neighbors.  Later, they might be able to help you. After checking out Iron Horse, I got a frantic call from a neighboring business.  Their computers were "making terrible noises."  Turns out they had failed during the power outage and the terrible noises were coming from their speakers.  Turning those computers off solved the problem until the user could return.

 
I then decided to check out other businesses in my complex.  One doctor's office was completely out of business.  Though they had power, they had Verizon DSL Internet access and that was out.  Since they had converted to electronic record keeping and the records were centralized in another office, they could not treat any patients.  Furthermore, they could not call many of their patients because many phone lines were also down.

 
DSL typically costs less than other connections but.... the phone companies do not promise the reliability that they do with other connections.  Consumer grade DSL connections, like those you might have at your house, have even less of a reliability promise than business grade DSL connections.  All of the people on Verizon DSL in that complex did not have Internet access for days and when it came back up, it flickered up and down due to both power and connectivity issues. Millions of Verizon FiOS and traditional land line customers had no dial tone either.  [If you want to talk about ways to keep your Internet and phones up and running, just ask.]

 
I offered what help I could (for free) to get my neighbors back on track.

 
(19)  Your business continuity plan might have to take into account that you might have more business than usual.  Many businesses had to shut down due to a lack of power, but some restaurants and gas stations had to stay closed because their registers or credit card machines would not work.  Those restaurants and gas stations that could stay open had tremendous amounts of business because people could not cook.  A couple of nearby restaurants ran out of food and had to close!

 
(20)  Sometimes alternate procedures are simple.  For those businesses in my complex trying to reach clients, I told them that Verizon land line numbers might not work.  Even if it did ring on their end, it might not ring on the other end.  If they were able to leave a message, it was fairly likely that that person might not be aware they had a message for days.  If they could not get through on the land lines, I advised them to call the cell phone numbers they had on file.

 
I also explained to them that e mail works like picking up and sending mail at the post office.  Your e mail client on your machine posts a message to your e mail server.  That "post office" then contacts other post offices down the line until it can deliver it to the destination post office of your recipient.  At that point, the mail is considered "delivered," though the recipient still has to pick up the mail. However, if one of those handoffs between "post office" servers cannot be made because the connection is broken, the sending server just waits and tries again.  It keeps extending the wait period between retries until it finally decides (usually after 5 or more days) that it cannot get through and sends a message back to the sender.  Many clients will keep trying to send to a server (deliver a message to the post office) until they succeed or they time out (admit failure) as well.

 
While most people think that e mail is instantaneous and an assured delivery mechanism, it is not.  E mail may take days to deliver.  My record is having an e mail returned as being undeliverable 35 days after I sent it.  If something happens in transit an e mail may get corrupted or, more likely, disappear entirely.  Anti-spam measures often make valid messages disappear entirely with neither the sender nor recipient knowing those messages did not get through.  Even if you send a message and it gets all the way through to the recipient, it still does not mean that they have actually seen it.

 
This last point was especially important to the businesses I talked to.  Because messages may have been sent days ago and were being retried at varying intervals, when a link goes down you are almost certain to receive messages out of order.  So, a message sent 5 days ago might arrive today along with something sent 5 minutes ago.  But, you would be more likely to notice the 5 minute message because almost everyone looks at their inbox in terms of time and checks their most recent messages. The time stamp on a message is not the time of reception, but the time the message was first sent.  I had to warn these businesses that they needed to check for "new" messages that had been sent days in the past.

 
(21)  Recovery is not really possible.  You will never really fully make up for the time lost and the extra pain and suffering you had to go through.  That is life.

 
(22)  Learn something from the experience.  In writing this e mail, I have considered adjusting some of my plans.  For example, I need to buy more flashlights and batteries and hide them.  I have yet to have one of my neighbors ask me about their issues.  I offered to help, but I cannot help anyone who does not believe they need to do something. [Fortunately, most of my regular clients seemed to have weathered the storm nicely.]

 

**********

 

If you have to implement your business continuity plan, you are going to be in some sort of pain.  A key to alleviating that pain is to think ahead and plan for contingencies.  Iron Horse can help if you call on us.

 
And, do not think that you are immune.  You are not.  I recently advised a federal government client of mine whose budget had been unexpectedly cut to implement his business continuity plans and declare a "disaster."  Such a declaration would allow him the option of making "extraordinary" measures like curtailing services to business units because they could no longer be paid for with a decreased budget.  I advised another client whose key IT person got sick to implement their business continuity plan as well.

 
You do not know what is coming, but you do know that life is always a bumpy ride.  Be ready and get your "shock absorbers" in place.  If you have not done any business continuity planning or built reliability features into your workplace, maybe we need to talk.

 
If you have any pointers you would like to share, please e mail us back!


©2012 Tony Stirk, Iron Horse tstirk@ih-online.com