Share what you know with millions of people

Focus is the best place to turn what you know into remarkable content
×
0

How does consideration of disaster recovery impact data center planning?

How does consideration of disaster recovery impact data center planning, especially one large single vs. multiple data centers?

This question was asked during the Focus webcast "Achieve a Flexible Data Center that Can Scale"

Attachments

2
John McCoy
Solutions Architect, Perceptive Software
Posted on Sept. 2, 2011

First of all, what an excellent question!

The first thing I like to do when discussing disaster recovery (DR) is to define what it means. DR is often confused or used interchangeably with fault tolerance. DR refers to a system’s ability to resume operation after a disaster. “Disaster” is defined generally as a major catastrophe like a fire, flood, major storm, earthquake, etc. that renders a whole site inoperable. Fault tolerance on the other hand refers to system or system component failures. This subtle but important difference must be made clear to all parties.

Planning DR, should always begin by determining sufficient “geographic diversity” between sites. Geographic diversity usually (but not always) boils down to physical distance. This means that whatever disaster would strike one location should not also be likely to affect the other. The ability to relocate critical personnel to the DR location also has to be accounted for.

When considering DR the architect(s), engineer(s), and business users must also work together to balance the business needs against the costs of DR. The two primary requirements that will drive design and planning are the recovery time objective (RTO) and the recovery point objective (RPO).

RTO refers to the amount of time that can elapse before the DR site is online and serving traffic. Typically, the shorter the RTP, the more expensive the deployment. Further, once the RTO gets below a few minutes, you’re usually looking at an active/active configuration which has a significant impact on application design and performance.

RPO refers to the point in time where the DR site picks up (time since last refresh from the primary site). Like RTO, the smaller this number, the more expensive and complex the deployment will be.

Determining how much money should be spent on how much capability is the name of the game. If you’re dealing with a critical application where every minute of downtime can result in thousands of dollars of lost revenue, the expense of low or no RTO/RPO is justified. However, if some downtime can be tolerated, costs can be significantly reduced.

1
John Bagdanov
Chief Technology Advisor, IT Answers 4U
Posted on Aug. 31, 2011

The days of traditional Disaster Recovery (DR) sites should be relegated to Mainframe applications. With today’s technologies a company should never separate data center planning from DR. With clustering technologies in server, database and storage technologies every capital expenditure should first engage in some discussion around DR. DR and BC (Business Continuity) come in numerous flavors and can be provided to almost any application with a justifiable cost structure. Whether a company uses on site DR capabilities or off site BC, the costs are significantly less than traditional DR/BC solutions of the past.

1
Andrew Baker
Director, Service Operations, SWN Communications Inc.
Posted on Sept. 2, 2011

Business Continuity, of which Disaster Recovery is but a part, must be the ultimate goal of any infrastructure deployment.

-- How can I build it securely AND with high availability AND with easy access for employees, customers and partners AND with flexibility to address different market realities, including disasters and growth?

This is the question that has to be asked. Separating the infrastructure planning from the ongoing needs of the operation will result in much higher operational costs, and an inability to move quickly when market conditions change, or when disasters strike.

There are lots of ways to go about this, but the bottom line is that the whole organization needs to be involved in the planning, and the planning must account for business needs and goals.

-ASB: http://XeeMe.com/AndrewBaker

1
James  Myers
President & CEO, Contingency Now Inc.
Posted on Sept. 7, 2011

I have enjoyed reading the responses to this question. They all have great input and spot on regarding some key issues with DR and data center planning. From a Contingency Now perspective we always start at the business layer and work our way down to the systems/application level. The type of business with its associated revenue streams will dictate DR requirements which will then dictate the data center metrics. Hence the RTO and RPO metrics stated by Mr. McCoy. The majority of our past customers ended up with some form of hybrid approach to their DR solutions. No two business operations are exactly alike.

New technologies can and do support diversification across data centers (clustering). Within the SMB market space, the majority of business owners either don't need data center diversification or are not willing to pay for the added expense. Unless of course their business model demands the active/active environment. So, without having to write a small book on this topic, here's some pointers when discussing, reviewing or investing in a DR solution to a single or multiple data center environment. Most of the line items should be included in a Business Impact Analysis (BIA):

Identify the key products and/or services that drive the most revenue 80/20 rule applies here. 80% of your revenues come from 20% of your customers.
Identify the key business processes that support the above listed products and services.
Identify the underlying technologies that support the above listed processes.
Identify the type and level of knowledge workers to support all the above.
Assess the required up-time and down-time for each key product/service. You can usually get this information from your key customers.
From all the above, analyze and assess the technical infrastructure and architecture that is required to support this business environment. Don't forget electronic and physical security!
Weigh the type and amount of risk inherent with outsourcing to a third party data center.
Invest in an overall technology and work group recovery solution that best fits the company's needs and customers expectations.
Validate your solution through DR/BC exercises.
Invest in business recovery continuous improvement - your customers, suppliers, investors, employees and extended family members will love you for it.

If the company goes through an M&A, develops a new product/service or is looking at selling, then you may have to do everything listed above "again".

I trust this helps.
Thank you,
James M. Myers

0
Dan Snyder
Director of Technical Operations
Posted on Sept. 1, 2011
  • Recommended by:

The key to this is application design.

If your application is nicely distributed, i.e. knows how to use multiple databases fairly seamlessly, then it makes all the sense in the world to design multiple data centers out of the gate for redundancy purposes.

However, most applications are designed needing to be very close (i.e. very low latency) to a single core database server, and that means that you are stuck with one ever-growing single data center until your team figures out how to distribute your application.

Using Amazon's EC2 or Rackspace's Cloud Servers as an "on demand" or "temporary failover" DR site is a good idea if you're stuck in the single data center world due to application design or database constraints.

0
Art Carapola
President and CTO, NewVista Advisors
Posted on Sept. 8, 2011
  • Recommended by:

I’m going to take a slightly different, more hardware specific perspective in answering this question since the Webinar was about building a data center.

Regarding support for Disaster Recovery in a NEW data center, I begin the process by focusing on redundancies and diversities to minimize the potential for outage. I am not jumping to a “build a tier 4 data center”, but just making some good choices. As examples, multiple risers for power and communications, paying attention to potential building maintenance. Most of this is the obvious stuff.

Proceeding up the technology layers, I look at such items as electrical panel layout in relation to the cabinets. Within a data center I try to separate servers in a cluster into cabinets fed by different electrical panels, so that losing a main breaker on one panel does not bring down the entire cluster. There are many different design rules such as this that will help you avoid a disaster in the first place, which is the obvious goal.

At a higher level I would ask if there is a disaster recovery plan and site in place. If there is this data center will need to fit into the overall DR architecture that is part of the plan. There are a few challenges here, including changes to data replication, especially during the migration period to the new data center. If there are multiple data centers, you need to look to see if DR is achieved through paired data centers, each data center acting as a DR site for the other data center. If so, and if the new data center is planned as a consolidation of the two paired data centers, you now need to define a new strategy for a fail-over site.

If the new data center is a single data center for the company and there is no current DR site, than you will need to go through the classic Disaster Recovery planning that has been discussed so far.

Whether the DR environment is a shared (paired) data center or a dedicated DR site, many implementations are now virtualized, allowing the DR environment to be quickly implemented on repurposed hardware (hardware moved from low criticality production into DR support for the failed site). As the organization is developing a virtualization strategy, the requirements for DR should be considered.

Communications capacity is an issue to focus on. If the fail-over site is one of multiple company data centers, the communications environment will need to support both the normal operational communications traffic from its primary customer base, but also the communications traffic from the site being failed-over. This includes the circuits themselves, as well as the network edge infrastructure (firewalls, VPN devices, routers, etc) that secure the environment and support the line.

Regarding the discussion around applications and application distribution, the comments are of course completely valid. If you have an application that is clustered in some geographically distributed fashion, Disaster Recovery becomes more of an application and network level exercise than a data center specific planning effort. As valid as this discussion is, I see it as wishful thinking in the vast majority of the environments I have seen and been involved in. In order to achieve this clustered operation and failover, changes to the core architecture of business apps are usually required and that is just not on anyone’s radar when DR is considered an insurance policy.

Finally, pay attention to data center monitoring and alerting, first to understand when something is going bad and addressing it before it becomes a disaster and second to alert of failed environments quickly. In addition to the IT alerts, there should be a robust environmental monitoring facility. This should include temperature, humidity, CRAC availability, fluid leak detection, power/UPS monitoring, etc. All of this monitoring is important but it needs to be supported by well defined and embraced processes.

Answer This Question