IT Preppers! Are You Ready for the Datacenter Apocalypse?

By Mike Richardson, on May 27th, 2012

Tornadoes : Hurricanes : Earthquakes : Black-outs : Floods
Fires : Tsunamis : Terrorism : Riots : Zombies

At any moment, your business data could be lost due to any number of natural or un-natural disasters. The continuation of your business following one of these apocalyptic events depends not on what you do in the hours and days following a disaster, but rather, what you do now to prepare.

This post is your IT Prepping Guide to ensuring your career and business survives following a data compromising event in one of your facilities.

Step 0: Updated Resume “Your Own, Personal, DR Plan”

This first step will be required if you do nothing else. Once the company you work for determines your lack of planning is responsible for their lack of recovery, you will be looking for new work. A resume, however, is not foolproof—as word of your negligence will most likely surface on the Twitter. To be safe, you may also plan for a career change and a new state or country to live in.

Step 1: Define Business Service Levels

Once you have decided you are serious about disaster recovery (DR) planning, you first must start by setting your DR goals for the business. This begins with mapping out what applications are critical for the business to survive and how long they can be offline in the event of a disaster.

Revenue, Brand, Customer and other losses during downtime must be clearly understood in order to accurately determine what an acceptable tolerance for loss is for your business. These services levels dictate the strategies required to recover applications while maintaining acceptable levels of loss. Even if you don’t go on to implement a DR strategy, at least you will know how screwed your business really is!

Step 2: Design and Implement a Disaster Recovery Strategy

A good DR strategy aligns the business needs with the technologies required to meet them. You should focus on methodologies as you build out your DR strategy. Methodologies are the glue that binds one or more technologies in a way that supports a particular business SLA. Products should be a footnote in a DR strategy, not the focus.

For example, you may have a SLA that support 24 hour Recovery Points within a 4 hour Recovery Window. You may create a methodology that incorporate Netbackup backups stored and replicated with a Data Domain to accomplish this. The methodology might be called “Replicated Disk-Based Backups” while the technologies that support it would be NetBackup and Data Domain. You then map business applications to methodologies as you build out your strategy. This gives you the flexibility to re-orient applications between methodologies as SLAs change, and also allows for new technologies to replace existing technologies within a methodology as refreshes occur.

The product of this step is a written and executed DR strategy that includes:

A) Business Applications and Dependencies

B) SLAs Required for Business Applications

C) Methodologies Supporting Required SLAs

D) Technology Architectures Enabling Methodologies

Whenever I ask someone what their DR strategy is and they respond with “We use product X”, I know immediately that they have no strategy. Don’t be that Guy!

Step 3: Develop a Detailed DR Plan

Once you implement a strategy for how you deal with disasters, you must document how to execute it. This is a critical step due to the state of confusion disasters cause. Good DR plans document and properly order the steps required for restoring access to business applications in a variety of full or partial DR scenarios. There are two main parts to a DR plan:

A) High-level orchestration plan that controls the recovery process flow and determines the order and timing for which teams should be engaged to perform specific recovery activities.

B) Detailed sub-plans that document the specific recovery steps each involved team must perform to fulfill their portion of the recovery.

For example you may maintain individual DR plans for Storage, Backup and Application teams, each documenting the steps a team must perform before a “hand-off” occurs to the next team. The orchestration plan documents the teams to be involved, the order in which they participate and the time expected for that team to perform their respective tasks.

It is very important the plan be in writing. If it is not in writing, it is not a plan. I can’t tell you how many customers I’ve asked “Do you have a DR plan?” who have answered an enthusiastic “Yes!”, only to look puzzled when I ask to see it. The plan must also be written clearly enough so that various team members can perform the steps (even if they are less familiar with the environment). It should be detailed even to the extent that 3rd parties can even perform the activities, should your employees be unreachable.

See, disasters have a way of redefining individual priorities. Let’s say, for example, you pay Backup Administrator, Fred, to be familiar with and perform restores of critical applications in the event of a disaster. When the Zombie Apocalypse strikes, do you think you will find Fred taking his chances with raised-floor datacenter zombies? Or, holed up in his basement, behind a hastily constructed, makeshift mattress barricade? I know where I would be.

You must prepare for the possibility your best talent may be unavailable during an actual disaster—especially if they work in the facility where the disaster occurs. Use their talents now to build a plan that even hired help can execute later. People aren’t the plan; people make the plan!

Step 4: Plan Regular DR Tests

After you complete your DR plan, you can rest assured that it is wrong. That’s right, it’s wrong. Your employees half-assed it together only because you made it a bullet point on their yearly review. In order to get it right, it must be practiced. If your employees are new to DR planning, you should try for quarterly tests and work your way back to semi-annually as some successful tests start to accumulate.

Testing serves three main purposes:

A) Plan Refinement: A DR plan is complex and there is a lot of room for error. Through regular testing, errors can be corrected, and the plan refined to improve accuracy and add new applications.

B) Employee Training: Regular tests help employees become comfortable with the process, and reduce the margin of error when performing an actual recovery.

c) Automation: Encourage employees to use testing as a way to automate the plan as much as possible. This reduces the effort required for them to participate during tests, reduces the chance of human error for automated tasks, increases the likelihood 3rd parties can assist in the recovery, and speeds the execution of the plan as a whole.

Be sure to set goals for both accuracy and timing, and celebrate when groups meet or exceed expectations. The only way you can know if you are improving is by measuring. Practice makes perfect!

Step 5: Perform Unscheduled Disaster Recovery Tests

Only God and Terrorists can plan disasters. Unless you are in one of these two groups, you will most likely be surprised when a disaster occurs. Since disasters happen unexpectedly, DR tests should also—occasionally—happen unexpectedly. This keeps your employee’s on their toes and helps you find gaps that planned tests won’t reveal.

For example, you may find that when Fred is on vacation, his replacement cannot complete his portion of the plan. This is a red flag that allows you to focus on training for his replacement and update the plan documentation so it is easier for others to execute. You may also find if employees are stacking-the-deck by taking precautions prior to scheduled tests that un-planned disasters would not allow for.

Step 6: Celebrate!

Becoming an “IT Prepper” takes years to master, and the work never really ends. If you have done the hard work and are successfully performing tests, make sure your management and customers know! There are very few organizations that perform at this level and doing so can really differentiate you and your company.

Final Thoughts:

As critical applications begin to shift into cloud architectures hosted by 3rd party companies, it is important for business to know how 3rd parties are ensuring access to those applications. If you are looking at cloud-services, don’t be caught off-guard by cloud companies who aren’t IT Prepping. Ask the questions and demand to see the strategies, plans and test results before trusting someone else with the future of your business.

If you are an executive, make sure to be aware of, and involved in, IT Prepping for your business. Few businesses I know perform satisfactorily at the above steps and most executives are completely unaware. You will be blamed when your business fails to recover, so you might as well be involved now!

IT Preppers! Are You Ready for the Datacenter Apocalypse?

Leave a Reply

About This Blog:

Recent Posts

Recommended Reading

Archives