A Lesson in Business Continuity
Friday, 6/29/2012, a derecho, a thunderstorm with sustained high
straight line winds, hit Virginia, Maryland, DC, Ohio, and some other
states. Large trees and limbs were toppled taking power and
communication lines down with them. Millions were without power and
numerous deaths were attributed to the storm. Few businesses were
*directly* impacted, though their electric, Internet, and phone
providers were.
I am not a fan of the term disaster recovery. When you hear that, you
think about a meteor taking out the whole building, not a key person
having appendicitis or a derecho. I like to think in terms of business
continuity. Business continuity is all about being able to operate,
perhaps in a degraded fashion, when things go wrong.
We did not have a whole lot of warning. The timing for many businesses
in the Washington, DC area was "good." The storm hit my home and office
at 10:30PM on Friday 6/29 and was gone by 11:30PM. The next week
included the July 4 holiday on a Wednesday, so lots of people were
taking time off that week anyway, though Iron Horse was supposed to be
open and all its personnel were in town.
This taught me the following lessons:
(1) Even what you think is a well laid plan will not work if something
else it depends on is not available. The basement of my home flooded
because the torrential rains filled up the basement stairwell.
Unfortunately, the sump pump at the bottom could not pump the water out
because it had no power. I also could not see what was happening
because my six year old had had fun with some of our flashlights and
they had run down, the bulbs had burnt out, or he forgot where he put
them after taking them from where they were supposed to be.
(2) Even if you test your plan, and everyone knows the plan, some of it
might fail. We had those flashlights working a couple of months
before. Though my six year old knew about the plan, those flashlights
were just an irresistible draw.
(3) Assessment is a necessary first step before you attempt to remedy a
problem, but remedies and even attempts at assessment can create further
problems... When I first opened the stairwell door to check on the
water, the power was operating and I could see everything was OK. The
second time, the power was out and I could not see clearly. I opened
the door and a wall of water came in. Oops!
(4) Be safe. Take care of your people. I wanted to see how bad stuff
was and maybe bail out the stairwell, but my wife (wisely) did not let
me go out in the storm. I was upset at having to mop up the basement
and did not want any more flooding, but going out in the storm would not
have been one of my brighter moves.
(5) Think creatively. I had two uninterruptible power supplies in my
house. One was connected to my TV, cable, and Internet. They continued
working when the lights went out so we found out what was happening. But
that one ran down quickly. The other one was connected to a computer
that was powered off. I picked up that UPS and put it on a stable stool
next to the stairwell door and plugged it into the GFI circuit (no, I do
not have a death wish and am very careful with electricity!) on the wall
that had the sump pump plugged in. Then I plugged the sump pump into
it. Yeah! The water immediately pumped out of the stairwell!
(6) Give up and take care of yourself. It was late at night. The
storm had passed. I had sucked the water out of the stairwell. We had
no power and I could not see whether other damage had been done. So the
wife, Fluppy the Puppy, my six year old, and I went to sleep in one
bed. Maybe sleep is not the right word to use with a six your old and a
dog in your bed....
(7) Use what works. I could have used my watch or my phone to work as
an alarm clock, but six year olds tend to get up early anyway....
(8) Luck does not hurt. Our power was on again by 7AM, but many people
in our area did not have power for a week. It was also the weekend, so
work was not an issue.
(9) Gather information and reassess as necessary. I found out the area
around my office had been especially hard hit by searching the
Internet. One man was killed by a falling tree on a major road nearby.
I knew the entire area would be a mess. It being Saturday at that
point, I resolved not to even try to get to work even though my web site
and e mail were down. Getting in the way of the work and emergency
crews was a bad idea. By Sunday morning, the connection to the office
was back up and so was the web site and e mail. Reports from the area
were still bad, so my staff and I stayed away on Monday and worked from
home.
(10) Fixes can cause their own problems or require new plans. When our
street got its power I started hearing a loud hum like that of a high
power piece of equipment. It was not in my house. At the end of my
street, a downed wire was arcing and sending flames 20 or more feet into
the air. The fire department came and closed off the street, but the
power crews were not able to cut the power for over an hour. Cutting
that power blacked out part of my neighborhood for many days.
(11) Pool your resources. One of our friends was in an area without
power for days. The following week was unbearably hot, so we invited
her family over and they charged up their cell phones and tablets, slept
in our air conditioning, and used our Internet connection (she often
works from home, but it had no power).
(12) Travel may not be an option, so teleworking can save the day.
Trees were down everywhere. Power lines were down. Stoplights were
dark. Travel was iffy, especially in the area near the Iron Horse
offices.
(13) Other people can make their problems yours. When I finally got
back in to work, I found a very large tree had snapped off about 20 feet
up and fallen on the roof of the business located directly above Iron
Horse. We had to have our cars out of the way of the crane coming in,
the possibility of a flood upstairs flooding us as well, and the tree or
part of it falling on to our brand new heat pump when they tried to
remove it. Fortunately none of that happened, but we prepared for it.
(14) It may not be over when it is over. Power blinked on and off at
the office multiple times like it did at my house as the storm went
through. There were also overvoltages which my uninterruptible power
supplies (UPSs) handled. Unlike my house, the power never failed for an
extended period, so my UPSs kept all the equipment working. However,
the power did blink on and off after the storm for days afterwards. I
observed it happening at my desk, but at least the UPSs kept us
working. There were brownouts (low power) events as well. These mini
outages and brownouts probably occurred as crews in the area powered on
parts of nearby grids. Brownouts, blackouts, and overvoltages can cause
hardware damage or data corruption. Just a few days ago, the UPS that
had protected the machine at my desk registered a battery failure, so
now I need to replace its batteries. Still, it did its job. On
occasion, I have seen UPSs fail because though the electronics read the
batteries as being OK, they were not and failed. [If you are wondering
at this point whether your UPS is up to the task, ask us.]
(15) Backup and restore is not just about computers, it is about
people, communications, places to work, and other resources as well.
When our Internet link failed our phone lines went with it, but they
were automatically redirected to cell phones and we were able to do much
of our work at home.
(16) Emergency services may not be available. When an entire area gets
clobbered, you cannot count on emergency services being available to
you. They may be otherwise engaged, be working a larger or more urgent
issue, or may have their own issues. Fairfax County citizens not only
lost their dial tone, but the 911 service and its backup also got
knocked out. Yes, it was not supposed to happen.... This is a very
good reason for intentionally delaying recovery efforts as part of a
continuity plan. If you get in the way of emergency crews or you need
their assistance, you have a big problem. Sometimes just sitting back
and saying, "We're hosed. Everyone's hosed. Let's relax." is best.
(17) Sometimes issues cascade and you must deal with those. Right
after the power went down the temperatures jumped up. This hampered
emergency crew response times, but also made it impossible to safely use
many electronic devices. It was simply too hot and humid for them to
operate. Computers do not behave well in hot, humid, un-air conditioned
environments. The heat was so high it bent some train tracks and took
out some electrical switch gear. High heat and lack of air conditioning
made it imperative that many people find cool shelter.
There were some other failures of equipment. A local movie theater with
power had a lot of disgruntled customers after the storm because they
could not show some of their movies. That was because movie theaters
now get digital copies of the movies they show and their servers and
networks were "acting up." In other words, they did not have proper UPS
protection or equipment. Even after they got power back, recovery
procedures and smaller power glitches kept them from being able to do
business.
(18) Help your neighbors. Later, they might be able to help you. After
checking out Iron Horse, I got a frantic call from a neighboring
business. Their computers were "making terrible noises." Turns out
they had failed during the power outage and the terrible noises were
coming from their speakers. Turning those computers off solved the
problem until the user could return.
I then decided to check out other businesses in my complex. One
doctor's office was completely out of business. Though they had power,
they had Verizon DSL Internet access and that was out. Since they had
converted to electronic record keeping and the records were centralized
in another office, they could not treat any patients. Furthermore, they
could not call many of their patients because many phone lines were also
down.
DSL typically costs less than other connections but.... the phone
companies do not promise the reliability that they do with other
connections. Consumer grade DSL connections, like those you might have
at your house, have even less of a reliability promise than business
grade DSL connections. All of the people on Verizon DSL in that complex
did not have Internet access for days and when it came back up, it
flickered up and down due to both power and connectivity issues.
Millions of Verizon FiOS and traditional land line customers had no dial
tone either. [If you want to talk about ways to keep your Internet and
phones up and running, just ask.]
I offered what help I could (for free) to get my neighbors back on
track.
(19) Your business continuity plan might have to take into account that
you might have more business than usual. Many businesses had to shut
down due to a lack of power, but some restaurants and gas stations had
to stay closed because their registers or credit card machines would not
work. Those restaurants and gas stations that could stay open had
tremendous amounts of business because people could not cook. A couple
of nearby restaurants ran out of food and had to close!
(20) Sometimes alternate procedures are simple. For those businesses
in my complex trying to reach clients, I told them that Verizon land
line numbers might not work. Even if it did ring on their end, it might
not ring on the other end. If they were able to leave a message, it was
fairly likely that that person might not be aware they had a message for
days. If they could not get through on the land lines, I advised them
to call the cell phone numbers they had on file.
I also explained to them that e mail works like picking up and sending
mail at the post office. Your e mail client on your machine posts a
message to your e mail server. That "post office" then contacts other
post offices down the line until it can deliver it to the destination
post office of your recipient. At that point, the mail is considered
"delivered," though the recipient still has to pick up the mail.
However, if one of those handoffs between "post office" servers cannot
be made because the connection is broken, the sending server just waits
and tries again. It keeps extending the wait period between retries
until it finally decides (usually after 5 or more days) that it cannot
get through and sends a message back to the sender. Many clients will
keep trying to send to a server (deliver a message to the post office)
until they succeed or they time out (admit failure) as well.
While most people think that e mail is instantaneous and an assured
delivery mechanism, it is not. E mail may take days to deliver. My
record is having an e mail returned as being undeliverable 35 days after
I sent it. If something happens in transit an e mail may get corrupted
or, more likely, disappear entirely. Anti-spam measures often make
valid messages disappear entirely with neither the sender nor recipient
knowing those messages did not get through. Even if you send a message
and it gets all the way through to the recipient, it still does not mean
that they have actually seen it.
This last point was especially important to the businesses I talked to.
Because messages may have been sent days ago and were being retried at
varying intervals, when a link goes down you are almost certain to
receive messages out of order. So, a message sent 5 days ago might
arrive today along with something sent 5 minutes ago. But, you would be
more likely to notice the 5 minute message because almost everyone looks
at their inbox in terms of time and checks their most recent messages.
The time stamp on a message is not the time of reception, but the time
the message was first sent. I had to warn these businesses that they
needed to check for "new" messages that had been sent days in the past.
(21) Recovery is not really possible. You will never really fully make
up for the time lost and the extra pain and suffering you had to go
through. That is life.
(22) Learn something from the experience. In writing this e mail, I
have considered adjusting some of my plans. For example, I need to buy
more flashlights and batteries and hide them. I have yet to have one of
my neighbors ask me about their issues. I offered to help, but I cannot
help anyone who does not believe they need to do something.
[Fortunately, most of my regular clients seemed to have weathered the
storm nicely.]
If you have to implement your business
continuity plan, you are going to be in some sort of pain. A key to
alleviating that pain is to think ahead and plan for contingencies.
Iron Horse can help if you call on us.
And, do not think that you are immune. You are not. I recently advised
a federal government client of mine whose budget had been unexpectedly
cut to implement his business continuity plans and declare a
"disaster." Such a declaration would allow him the option of making
"extraordinary" measures like curtailing services to business units
because they could no longer be paid for with a decreased budget. I
advised another client whose key IT person got sick to implement their
business continuity plan as well.
You do not know what is coming, but you do know that life is always a
bumpy ride. Be ready and get your "shock absorbers" in place. If you
have not done any business continuity planning or built reliability
features into your workplace, maybe we need to talk.
If you have any pointers you would like to
share, please e mail us back!
©2012 Tony
Stirk, Iron Horse tstirk@ih-online.com