Failover Solutions

What is Failover?

Failover is the operational process of switching between primary and secondary systems or system components (a server, processor, network, or database) in the event of downtime. Such downtime could be caused by either scheduled maintenance, or unpredicted system or component failure.

In either case, the object is to create fault tolerance – to ensure that mission-critical applications or systems are constantly available, regardless of the type or extent of the fault. In the larger picture, failover is a key component of business continuity plans, especially for businesses that are computer or computing-centric.

Failover - Availability Through Redundancy

High Availability and Web Server Failover

In the case of web-centric businesses or enterprise applications, web server uptime is mission-critical as it affects all aspects of activity. From business performance to client trust, retention and relations, high availability is the one core metric on-line businesses can’t afford to ignore.

High availability” refers to a system design approach or service level agreement that aims to guarantee a predefined level of operational performance. In simpler terms – high availability is used to describe any system or process whose degradation or stoppage would negatively affect revenue, customer satisfaction, employee productivity, or brand.

To maintain high availability even during scheduled maintenance and unscheduled downtimes, an online business must always have a solid web server recovery strategy.

Ideally, web server failover should be a seamless process, insulating the end user from any degradation of service.

However, maintaining absolute seamlessness can be complex and expensive. Therefore, organizations usually evaluate the cost-benefit of system importance versus overall downtime impact, and choose one of three possible web server failover levels:

  • Cold Failover – the simplest and generally least expensive approach involves the maintenance of parallel, standby system which remains dormant in the absence of failure. In the event of failure, the secondary system is started up, brought online, and assumes production workload.
    The downside of cold failover strategy is that downtime, while likely much lower than in a “no failover” scenario, can be significant, with notable negative impact on quality of service. Because of the costs associated with maintaining and synchronizing dedicated servers, a cold failover strategy is most suitable for implementation in data centers.
  • Warm Failover - this approach leverages a standby system that is rapidly available in the event of production system failure. Although some lag may be experienced by users in the event of failure, warm failover minimizes the negative performance impact of system or component failure, ensuring cost-effective continuity of service.
    Similar to cold failover, this strategy is most suitable for implementation in data centers owing to the high costs associated with synchronizing and maintaining dedicated servers.
  • Hot Failover – this more complex and costly approach comes as close as possible to ensuring 100% uptime. In this scenario, two parallel systems or system components run continuously – one production, one backup. The backup system is constantly synchronized, and provisioned to take over the production system at any time.
    A hot failover strategy is generally applied to individual servers, since the cost of applying the method to an entire data center would be prohibitive, except in the most extreme high availability cases.

Determining Web Server Failover Strategy

When choosing between the various web server failover strategies, organizations generally attempt to calculate RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Taking these two values into account, it is simpler to choose the most cost-effective web server failover approach:

  • RTO is the time period within which service must be restored to avoid unacceptable consequences. By calculating how rapid recovery from web server failure should be, organizations can know what level of preparation is required. If RTO is 10 minutes, then significant investment in disaster recovery would be required. For an RTO of 36 hours, a significantly lower investment would be warranted.
  • RPO is the maximum tolerable period in which data may be lost. Basically, RPO defines how much data an organization can afford to lose (an hour, a day, a week?). Based on this, optimum backup frequency and recovery speed can be determined, part of which is, of course, web server failover level.

Hardware vs. Cloud Web Server Failover Options

Traditionally, web server failover solutions were hardware-based. However, appliance-based solutions are inevitably more costly and demanding from a setup and maintenance point of view, and thus have been out of reach for many organizations.

Now, the advent of widespread cloud computing is changing all that – providing a better next generation alternative to on-premises appliances, and offering several notable advantages:

  • No Colocation issues – unlike on-premise appliances, cloud failover solutions don’t share physical space with the production server, providing a much safer alternative. In the event of failover - be it power outage, DDoS, earthquake or any other catastrophic event – the cloud disaster recovery mechanism will remain unaffected and active, when needed most.
  • Economy of Scale Pricing - cloud failover is far more cost-effective, since shared resources enable “economy of scale” pricing, and cloud resources are externally managed, markedly lowering maintenance overhead.
  • Seamless - modern cloud solutions enable nearly-seamless transfer of traffic from a point of failure in one data center, to a healthy parallel system in another data center. Such services are not Time to Live (TTL)-reliant and provide excellent RTO, offering nearly-uninterrupted service to customers.

DNS Failovers Versus Cloud Failover

DNS failover is an alternative to appliance-based failover solutions, and can guarantee that if a pre-defined website, service, or internet connection is offline, traffic is automatically re-routed to a secondary IP address, server, or provider.

DNS failover solutions generally include health-checking agents which monitor the availability of each application location (also referred to as application "endpoint"). In the event that an endpoint fails, these solutions route traffic away from the failed endpoint to one or more other, healthy endpoints, generally using a simple “round robin” methodology.

Although DNS failover solutions can be effective for simpler, less-trafficked applications or web sites, they have few crucial limitations, which lower the overall efficacy for mission-critical deployments. Most importantly, in DNS failover scenario one cannot take predetermined the unknown percentage of users who will coninute to recive cached DNS data, with varying amounts of Time to Live (TTL) left.

The DNS CompromiseThe DNS Compromise: Costly applicances, split architecture, upstream caching issues

And so, until TTL times out, visitors will still be redirected to the dead server. Even if TTL is set to a low value, which can negatively impact performance, the possibility of some users “getting lost” still exists – which is unacceptable for business-critical applications.

In contrast, Cloud-based web server failover solutions deliver all the advantages of remote deployment, with none of the faults inherent in DNS-based alternatives.

Cloud is not TTL-reliant and is flexible enough to manage complex policies, marrying the best of appliance and DNS solutions.

Cloud-based services also offer far more sophisticated load balancing - managing and can be used to optimize load balancing policies on-the-fly and on a session-by-session basis, as opposed to DNS failover’s simplistic “round robin” management, which relies on rigid and predetermined policies.