Asset Management in IT Disaster Recovery

Unlike any other crisis, Disaster Recovery (DR) depends upon the ability to recover damaged assets, because assets are needed to recover function, and restoring functionality is the goal of DR. Yet the skills of Asset Managers are often neglected by DR Planners, and many Asset Managers do not fully understand their role in Disaster Recovery Planning. Where to start? Understanding your role as an Asset Manager entails knowing the scope and some of the details of recovery planning. For scope, it’s useful to see DR in the context of crisis management.

Almost any type of corporate crisis, like cash flow, sudden market shift, executive succession, public perception, product failure, labor relations and regulatory changes, can be managed by a small team of experts and minimal pre-arranged resources. The notable exception is an operational crisis, especially one involving a data center. Without resources standing by – like computers, data, and networking – a recovery attempt is bound to fail. Business processes supported by information technologies have acceptable downtimes ranging from weeks to days to microseconds. Information Technology (IT) executives must be prepared to recover functionality within the acceptable downtimes of the processes they support, and most recovery-time objectives (RTO) cannot be met without substantial assets ready to be used.

IT assets typically depreciate, so, like most assets, the value we get from them is in their use. We usually don’t want to purchase an IT asset (or any asset) and have it sit idle. Yet, recoverability requires that we have a margin for error, such that the function of failed equipment can be moved to other equipment that isn’t being fully utilized. Generally, aside from accounting for growth, we don’t want more IT assets than we need, but assets allocated for recovery must be, by their nature, underutilized in the least or even unused at all, except for testing. Therefore, conscientious Asset Managers need to help keep the purchase and maintenance of recovery assets to a minimum.

IT assets include: (1) Hardware, (2) Software, (3) Data and (4) People. Each of these assets has a financial worth, but together they’re more than that. Combinations of these things gives you the functionality you need. For example, networking and communications requires a combination of all four.

A disaster, defined for the purpose of IT recovery beyond an emergency or component failure, is an event that forces movement of IT functionality to an alternative off-site location. Recovery of a company’s technology infrastructure is central to Business Continuity (BC), the overall picture of which also includes recovery of function outside of IT, though not a focus in this discussion.

As indicated previously, the primary purpose of the BC/DR discipline is to be able to recover functionality – not “things” after a disaster. Of course, you need “things” (assets) in order to recover function. We speak of those things as “recovery resources”. If you have the right resources in place to recover, including human resources, you then have the ability to recover function. BC/DR, therefore, is all about recovering assets – hardware, software, data and people.

Which assets do we need for recovery of a data center and how soon do we need them following a disaster? The answer lies in understanding the business impact. In most disasters, recovering people is mostly a matter of providing workspace and tools. A good deal of the equipment and data recovery assessment is based upon what people need to do their jobs. If an impact analysis were to show that the requirement is to recover everybody and everything immediately, and if we can afford any expense, it might make sense to duplicate the data center in its entirety in another location and set up fully redundant networking. Such requirements apply to organizations like stock exchanges, brokerage firms, cloud providers, and Internet service providers. Most other firms do not require such expense. Some firms need immediate availability for some functions and not for others. That implies that those others have acceptable downtimes and maybe some tolerance for data loss.

Certainly, we want to recover, but we also want to minimize expense of having assets sit idle until a disaster occurs. Asset recovery is the critical factor in the recovery of functionality, and recovery of functionality within acceptable downtimes is the goal of BC/DR. If we can have fewer assets standing by but just enough to recover on time, we can minimize pre-disaster expense while assuring appropriate recovery of function within time frames that are acceptable.

Understanding acceptable downtimes is determined during a Business Impact Analysis (BIA). If you must recover the entire data center immediately (high availability), there’s no need for a BIA for DR, though you will still need it to recover the rest of the business. Under all other circumstances, that is, if you don’t require high availability, financial good practices demand a BIA.

A BIA examines potential losses and the business consequences of those losses, not just in terms of financial worth, but also in terms of corporate image, potential liabilities, and customer service. It’s that examination that ultimately gives us acceptable downtimes for individual business processes, and we use those downtimes to get at how much of each recovery asset is needed over various periods of times following a disaster. Why is that important? Why spend thousands of dollars and months of time conducting a BIA? And, why do we need to look at business processes when all we’re concerned about is recovering IT functionality?

The simple answer is that an organization can spend millions of dollars on recovery assets, but can substantially reduce that amount by limiting recovery asset expenditures pre-disaster. A BIA tells us how to limit those expenditures. The longer we can allow ourselves to wait before recovering an asset post-disaster, the less money we need to spend pre-disaster on that asset. For example, if maximum downtime for a business process is 4 weeks, we know we don’t need assets to recover it in 4 hours. Recovering IT function that supports that process at an alternative location in 4 hours means having everything in place, including networking, at that location before the red flag goes up. Recovering that same process in 4 weeks means that you have time to purchase equipment and install lines, assuming you’ve considered lead times. If we can avoid purchasing assets pre-disaster, we don’t ever have to purchase those assets unless we have a disaster, which, of course, is improbable. However, even if the probability is very low, it’s the impact that concerns us! And, if we do have a disaster, asset procurement can and should be covered by extra expense insurance.

Because money is an asset not to be used frivolously, if we determine that we need a BIA, we need to consider the cost. So, how can we conduct a BIA cost-effectively? Asset Managers can help both in the identification of critical assets and in quantifying the cost over time of recovering those assets within acceptable downtimes.

In order to fully understand how many of which assets we need to recover a business process within its acceptable downtime, we need two things: (1) the acceptable downtime and (2) the identification of assets needed for recovery. Knowing who to ask for this information is key to an efficient, cost-effective BIA that achieves valid results. Acceptable downtimes for business processes are obtained from Process Owners, whereas identification and quantification of assets needed for recovery are obtained from Functional Work Unit Managers with the assistance of Asset Managers. Here’s why…

Process Owners are usually able to estimate corporate bottom line losses due to the impairment of their process, but they probably cannot estimate reliably the resources needed to restore the process. That’s the bailiwick of the Functional Work Unit Managers who support the process. Yet those managers can’t tell you how long the whole process can be down. The person who runs Billing, for example, may think that process must be up within two days, since cash flow depends upon it. However, the CFO or director in charge of Payables will tell you that Billing can be down for a month, all things considered. Some of the things to be considered are the company’s credit rating, vendor confidence, interest on borrowing, and asset liquidity. The person in charge of Billing isn’t concerned with those things. Instead, what that person sees is immediate complaints from superiors (probably the Process Owner) when the process is down. The point: the people you need to estimate acceptable downtimes are not the same people who can tell you resource needs to bring a process back up, and vice versa.

Knowing whom to ask for what information is issue #1. Issue #2 is about how to ask.

Let’s look first at getting Maximum Acceptable Downtimes (MAD). This information can be obtained in a facilitated session with Business Process Owners. Those folks understand what their process is for, even though they may not fully comprehend the details of how it functions. With guidance by a facilitator, Process Owners can be directed to determine a truly workable acceptable downtime for their process. This facilitated process is fully described in the book Knowledge at Risk available on Amazon.com.

Having the Maximum Acceptable Downtime (MAD) that we just determined for any given Business Process, we are now positioned to speak with the people who know how to achieve the MAD, the folks who understand the details of how resources (starting with personnel) are used to fulfill process objectives. These are the people who run the Functional Work Units (FWU) that support the Business Process. We use the MAD as a “stake in the ground” to determine, for example, how many people in the Functional Work Unit are needed in the first 4 hours, next 8 hours, 48 hours, 1 week, 2 weeks, etc. following a disaster. These people also have a better understanding of data loss tolerances than the Process Owner, because the FWU Head knows how data can be retrieved from a variety of sources.

Once we understand recovery assets (resources for recovery), it’s a matter of developing a viable recovery strategy, like hotsite vs. buildout or PC purchase vs. quick-ship or network rerouting vs. redundant networking or tape backup vs. virtual tapes vs. VMware solutions or solutions relevant to any number of issues that represent a total strategy. Participating in this analysis is one way that Asset Managers can assist in the planning process.

Having a strategy, we then need to turn that general strategy into a viable plan, investigating options and determining which are most cost-effective, including present-value analysis – a key area in which Asset Managers should be involved.

Next we need to assign people to an IT Recovery Team structure and make them accountable by assigning function and developing specific instructions for them to follow over a timeline so that recovery assets are used in the most efficient manner.

For the most part, that’s your Disaster Recovery Plan. Once it’s done, it may look like just another book on a shelf (actually I recommend a plan that fits on your smart phone), but it’s really the documentation of an IT recovery strategy that you know will support the business appropriately, that is… each critical process recovered within its acceptable downtime. Such a strategy is achieved by thoroughly understanding asset recovery. Asset recovery is the key to Disaster Recovery, and that’s why Asset Managers are so important to the DR planning process.

About Marv Wainschel

Marv Wainschel is the Director of McWains Chelsea