Technology / System Admin

VMware High Availability vs Fault Tolerance vs Disaster Recovery

by Graham Barker
VMware High Availability vs. FT vs. DR
Follow us
Published on October 31, 2022

VMware offers several great products with a key focus on keeping your workloads online. In this post, we’ll discuss which technologies we can utilize to maintain high levels of availability and zero downtime, as well as our options when a disaster strikes.

Need VMware Training?

Every system administrator should have a basic understanding of virtualization — and VMware is by far the industry leader when it comes to virtualization technologies. Whether you are brand new to virtualization or have experience using VMware, CBT Nuggets can help you level up your VMware virtualization skills. You'll find a wide range of VMware training, not only for different skill levels but job roles as well!

Sign up for a 7-day free trial and start exploring your training options today!

VMware’s High Availability Solution Included With vSphere

Both the Standard and Enterprise Plus licensed versions of VMware’s vSphere solution include High Availability options.

High Availability kicks in when an ESXi host fails. When this happens many things occur behind the scenes but the main point here is that virtual machines which were running on the failed host, are automatically rebooted onto other hosts within the same cluster.

This feature is not enabled by default, however, you need to have created a cluster containing more than 1 ESXi host and enabled the High Availability option in the configuration settings of the cluster.

vSphere High Availability Configuration is Simple

There are a series of configurable options for vSphere High Availability that are designed to be easy and quick to roll out. 

  • Failure Response. Should an ESXi host fail, you can choose to restart VMs onto other operational hosts in the cluster or turn off the feature altogether

  • Response for Host Isolation. If a host becomes isolated from other ESXi hosts on the network but is still online, you can choose to do nothing, power off and restart VMs on other hosts or attempt a clean shutdown of the VMs and restart them on other hosts.

  • Datastore with PDL. vSphere can also monitor datastore health and perform actions accordingly. If a datastore is detected to be completely offline (Permanent Device Loss) then vSphere can attempt to restart virtual machines on other hosts which still have access to the datastore.

  • Datastore with APD. Should virtual machines be running on an ESXi host that suffers an All Paths Down event (APD), then similar options are available to configure as with PDL. Here you can choose how long to wait before vSphere responds to the APD event. This option is available because APD issues can recover automatically.

  • VM Monitoring. Virtual Machines Monitoring is interesting — it allows you to automatically reboot virtual machines if the VMware tools' heartbeat has not been received by the host in a configurable period. This can be useful if a virtual machine becomes unresponsive or has a BSOD event. VM Monitoring should be configured with care to prevent false positives.

One thing to take into consideration is that while VMware’s High Availability feature is for “high” availability, it is not continuous, 100% availability. For this, we need to use something else. Fault tolerance is one such option provided by VMware.

What is VMware Fault Tolerance?

HA is a great vSphere feature, but a reactive one. You'll be able to maintain reasonable virtual machine availability but depending on your infrastructure, virtual machines could take up to 15 minutes or more to reboot. This is where the power of fault tolerance comes in.

VMware’s Fault Tolerance solution creates a copy of a virtual machine on another host, ideally in another location on separate storage. Everything that happens within the virtual machine including all storage, memory pages, and CPU cycles is transferred rapidly to the secondary virtual machine, maintaining an exact replica of the primary at all times.

Should the host running the primary virtual machine suffer an outage, the secondary virtual machine is instantly made primary and continues as if nothing happened.

What is vSphere Fault Tolerance Useful for?

This kind of technology is best suited for applications that do not have built-in clustering support. Most often this is ideal for legacy applications that require the highest levels of availability but cannot be re-developed to achieve high expectations of availability.

What are the VMware Fault Tolerance Requirements?

Fault Tolerance can be complex to configure as there are several requirements and limits. For vSphere 7.x environments the following need to be taken into account:

  • CPUs must be Intel Sandy Bridge or later OR AMD Bulldozer or later

  • A minimum of low latency, 10Gbit networking is required for FT. While not required, a dedicated network is recommended for best FT performance.

  • A recommended maximum of 4 FT VMs per host is in place but can be overridden.

  • A recommended maximum of 8 FT-enabled vCPUs per host is advised.

  • On the licensing side, Standard and Enterprise versions of vSphere support up to 2vCPUs of FT per host. For Enterprise plus, this is increased to 8vCPUs per host

  • Hosts need to be certified for Fault Tolerance, you can check this in the VMware Hardware Compatability List 

VMware Disaster Recovery Options Enable Complete Recovery.

When we encounter localized disasters, DR and FT can work very well. However, what if there is a complete disaster at the data center or storage layer?

To fully protect workloads, it’s important to consider an off-site replica of virtual machines. This is beneficial because, providing you have ESXi hosts and storage at the secondary site, VMware’s DR solutions can replicate your virtual machines to the other location. With the click of a button, a full DR invocation can be started.

There are two VMware technologies that we can review when it comes to VMware DR, the first is vSphere Replication and the second is Site Recovery Manager.

What is vSphere Replication?

VMware’s vSphere Replication is included with vCenter Server Standard and replicates virtual machines to the same vCenter server, or another vCenter server in another location. vSphere Replication can recover virtual machines one at a time but not in batches.

vSphere Replication can sync a virtual machine to the secondary “DR” site and also enable point-in-time recovery snapshots. This is particularly useful if you need to recover VMs to an earlier point in time. 

vSphere Replication is also customizable in that you can change the frequency between syncs, also referred to as the RPO, (Recovery Point Objective). You can select anywhere between a five-minute RPO and 24 hours depending on how much data loss you can deal with during a full DR event. This is also a quick way to protect somewhat against CryptoLocker-style attacks, although there are more bespoke solutions for this.

The requirements for vSphere Replication are straightforward.

  • Enough bandwidth between site A (Protected) and B (secondary) for the replication traffic

  • A vCenter Server on each of the sites

  • One or more hosts on each site

  • Storage Array, or vSAN, or a combination at both sites

  • vSphere Replication appliance installed on both sites

  • For licensing, vSphere Replication is included with vCenter Server Standard

What is the Difference Between vSphere Replication and Site Recovery Manager?

Site Recovery Manager is the solution used to orchestrate virtual machine recovery, with SRM you can invoke DR which will power on all replicated VMs (or a group of VMs) in a specific order with other advanced features such as changing IP addresses and other virtual machine characteristics during the recovery event where necessary.

Another useful feature of SRM is the ability to run non-disruptive DR tests. A DR invocation test can be initiated, but this time VMs are powered on in an isolated test network so that they can be tested but without connecting back to live systems. Reports can be created after a test to show management the outcome and the recovery time required.

As well as SRM, you need a tool to copy the data from the primary to a secondary site. This is where vSphere Replication comes in. Alternatively, you can use storage array replication for this task, however many admins prefer the granular approach that vSphere Replication has, enabling you to choose individual VMs for replication and to configure them separately.

For Site Recovery Manager you will need at least:

  • A supported replication solution installed (Supported array-based replication or vSphere Replication) 

  • An SRM installation at both sites

Wrapping Up

As discussed, HA allows for a reasonably good level of availability for virtual machines. VMs are automatically rebooted on other hosts if their host should fail. HA can further protect VMs by reacting to storage failures in various ways.

In contrast, FT syncs a virtual machine’s CPU and memory to a secondary virtual machine on another cluster, even in a different geographic location.

Should there be a host issue or any other infrastructure failure event, the secondary VM can be made primarily at a moment’s notice, ensuring an incredibly good availability for FT-protected virtual machines.

HA is simple to configure and comes with all versions of vSphere. It’s worth noting that FT is more complex and requires a certain type of hardware and configuration to operate correctly.

The other main difference between HA and FT is the licenses required. A standard vSphere license limits the number of protected vCPUs to only 2, whereas an Enterprise Plus vSphere license restriction is much better at 8 protected vCPUs per host.

When it comes to Disaster Recovery, VMware’s vSphere Replication option, combined with Site Recovery Manager allows for a point-in-time recovery to another data center in just a few clicks.

DR is therefore different from FT and HA because it involves data loss, but the amount of data loss from a full-scale disaster can be minimized, by reducing the time between synchronizations.


Download

By submitting this form you agree to receive marketing emails from CBT Nuggets and that you have read, understood and are able to consent to our privacy policy.


Don't miss out!Get great content
delivered to your inbox.

By submitting this form you agree to receive marketing emails from CBT Nuggets and that you have read, understood and are able to consent to our privacy policy.

Recommended Articles

Get CBT Nuggets IT training news and resources

I have read and understood the privacy policy and am able to consent to it.

© 2024 CBT Nuggets. All rights reserved.Terms | Privacy Policy | Accessibility | Sitemap | 2850 Crescent Avenue, Eugene, OR 97408 | 541-284-5522