VMware HA and FT are often misunderstood. Here I will take a deep dive look at what FT provides, read this article for HA deep dive.
What is vSphere FT
vSphere Fault Tolerance (FT) is for most mission critical virtual machines. FT provides continuous availability for virtual machine by creating and maintaining another VM that is identical and continuously available to replace it in the event of a failover situation.
The protected virtual machine is called the Primary VM. The duplicate virtual machine, the Secondary VM, is created and runs on another host in the same or different vSphere Cluster. The primary VM is continuously replicated to the secondary VM so that the secondary VM can take over at any point, thereby providing Fault Tolerant protection.
The Primary and Secondary VMs continuously monitor the status of one another to ensure that Fault Tolerance is maintained. A transparent failover occurs if the host running the Primary VM fails, or encounters an uncorrectable hardware error in the memory of the Primary VM, in which case the Secondary VM is immediately activated to replace the Primary VM. A new Secondary VM is started and Fault Tolerance redundancy is reestablished automatically. If the host running the Secondary VM fails, it is also immediately replaced. In either case, users experience no interruption in service and no loss of data.
A fault tolerant virtual machine and its secondary copy are not allowed to run on the same host. This restriction ensures that a host failure cannot result in the loss of both VMs.
Fault Tolerance avoids “split-brain” situations, which can lead to two active copies of a virtual machine after recovery from a failure. Atomic file locking on shared storage is used to coordinate failover so that only one side continues running as the Primary VM and a new Secondary VM is respawned automatically.
Use Cases
Fault Tolerance provides a higher level of business continuity than vSphere HA. When a Secondary VM is called upon to replace its Primary VM counterpart, the Secondary VM immediately takes over the Primary VM’s role with the entire state of the virtual machine preserved. Applications are already running, and data stored in memory does not need to be reentered or reloaded. Failover provided by vSphere HA restarts the virtual machines affected by a failure.
- Applications which must always be available, especially applications that have long-lasting client connections.
- Applications that have no other way of doing clustering.
Another key use case for protecting a virtual machine with Fault Tolerance can be described as On-Demand Fault Tolerance. In this case, a virtual machine is adequately protected with vSphere HA during normal operation. During certain critical periods, you might want to enhance the protection of the virtual machine. For example, you might be running a quarter-end report which, if interrupted, might delay the availability of critical information. With vSphere Fault Tolerance, you can protect this virtual machine before running this report and then turn off or suspend Fault Tolerance after the report has been produced.
Requirements
CPU Requirements:
- Intel Sandy Bridge or later. Avoton is not supported.
- AMD Bulldozer or later.
Network:
- Use a 10-Gbit logging network
- Low latency
- Dedicated FT network is highly recommended.
Limits
das.maxftvmsperhost
- This value states the maximum number of fault tolerant VMs allowed on a host in the cluster. The default value is 4. There is no FT VMs per host maximum, you can use larger numbers if the workload performs well in FT VMs. You can disable checking by setting the value to 0.
das.maxftvcpusperhost
- This value sets the maximum number of vCPUs aggregated across all fault tolerant VMs on a host. The default value is 8. There is no FT vCPU per host maximum, you can use larger numbers if the workload performs well. You can disable checking by setting the value to 0.
License limitations
The number of vCPUs supported by a single fault tolerant VM is limited by the level of licensing that you have purchased for vSphere. Fault Tolerance is supported as follows:
- vSphere Standard and Enterprise. Allows up to 2 vCPUs
- vSphere Enterprise Plus. Allows up to 8 vCPUs (Note in vSphere 6.5 this number was 4 vCPU)
vSphere features not supported
The following vSphere features are not supported for fault tolerant virtual machines.
- Snapshots. Snapshots must be removed or committed before Fault Tolerance can be enabled on a virtual machine. In addition, it is not possible to take snapshots of virtual machines on which Fault Tolerance is enabled.
- Storage vMotion. You cannot invoke Storage vMotion for virtual machines with Fault Tolerance turned on. To migrate the storage, you should temporarily turn off Fault Tolerance, and perform the storage vMotion action. When this is complete, you can turn Fault Tolerance back on.
- Linked clones. You cannot use Fault Tolerance on a virtual machine that is a linked clone, nor can you create a linked clone from an FT-enabled virtual machine.
- Virtual Volume datastores.
- Storage-based policy management. Storage policies are supported for vSAN storage.
- I/O filters.
- Disk encryption.
- TPM.
- VBS enabled VMs.
- VMDK files of 2TB+
- Physical CDRom or Floppy disk – ISO mount is supported
- NIC Passthrough
- Hot plugged devices
- Serial or parallel ports
- Video devices with 3D support
Best practice FT Configuration
Hosts running the Primary and Secondary VMs should operate at approximately the same processor frequencies, otherwise the Secondary VM might be restarted more frequently. Platform power management features that do not adjust based on workload (for example, power capping and enforced low frequency modes to save power) can cause processor frequencies to vary greatly. If Secondary VMs are being restarted on a regular basis, disable all power management modes on the hosts running fault tolerant virtual machines or ensure that all hosts are running in the same power management modes.
Host Networking Configuration
The following guidelines allow you to configure your host’s networking to support Fault Tolerance with different combinations of traffic types (for example, NFS) and numbers of physical NICs.
- Distribute each NIC team over two physical switches ensuring L2 domain continuity for each VLAN between the two physical switches.
- Use deterministic teaming policies to ensure particular traffic types have an affinity to a particular NIC (active/standby) or set of NICs (for example, originating virtual port-id).
- Where active/standby policies are used, pair traffic types to minimize impact in a failover situation where both traffic types will share a vmnic.
- Where active/standby policies are used, configure all the active adapters for a particular traffic type (for example, FT Logging) to the same physical switch. This minimizes the number of network hops and lessens the possibility of oversubscribing the switch to switch links.
FT logging traffic between Primary and Secondary VMs is unencrypted. It contains guest network and storage I/O data, and the memory contents of the guest operating system. This traffic may include sensitive data. Utilizing a private network for this communication is recommended.
Homogeneous Clusters
vSphere Fault Tolerance can function in clusters with nonuniform hosts, but it works best in clusters with compatible nodes. When constructing your cluster, all hosts should have the following configuration:
- Common access to datastores used by the virtual machines.
- The same virtual machine network configuration.
- The same BIOS settings (power management and hyperthreading) for all hosts.
Performance
To increase the bandwidth available for the logging traffic between Primary and Secondary VMs use a 10Gbit NIC, and enable the use of jumbo frames.
Store ISOs on Shared Storage for Continuous Access
Store ISOs that are accessed by virtual machines with Fault Tolerance enabled on shared storage that is accessible to both instances of the fault tolerant virtual machine. If you use this configuration, the CD-ROM in the virtual machine continues operating normally, even when a failover occurs.
Avoid Network Partitions
A network partition occurs when a vSphere HA cluster has a management network failure that isolates some of the hosts from vCenter Server and from one another. See Network Partitions. When a partition occurs, Fault Tolerance protection might be degraded.
In a partitioned vSphere HA cluster using Fault Tolerance, the Primary VM (or its Secondary VM) could end up in a partition managed by a primary host that is not responsible for the virtual machine. When a failover is needed, a Secondary VM is restarted only if the Primary VM was in a partition managed by the primary host responsible for it.
To ensure that your management network is less likely to have a failure that leads to a network partition, follow the recommendations in Best Practices for Networking.
Using vSAN Datastores
vSphere Fault Tolerance can use vSAN datastores, but you must observe the following restrictions:
- A mix of vSAN and other types of datastores is not supported for both Primary VMs and Secondary VMs.
- vSAN metro clusters are not supported with FT.
To increase performance and reliability when using FT with vSAN, the following conditions are also recommended.
- vSAN and FT should use separate networks.
- Keep Primary and Secondary VMs in separate vSAN fault domains.
Conclusion
Using FT as part of you infrastructure architecture requires planning for it. You should be aware of the limitations the solution provides. There are far less limitations on vSphere 7, so the product keeps improving. To reduce VMware complexity application load balancing can be used instead.