Understanding vSAN Storage Policies, Their Strengths, and Their Trade‑offs

Storage Policy–Based Management (SPBM) is the backbone of how VMware vSAN delivers predictable, workload‑aligned outcomes. Instead of carving LUNs or managing fixed RAID groups the old-fashioned way, policies define the storage behavior of each VM and each VMDK—granular, dynamic, and automated. This approach simplifies operations, enables workload‑specific tuning, and eliminates the rigidity of traditional storage constructs.

Below, we’ll explore the key vSAN storage policy components, along with their advantages and disadvantages, from an architectural design perspective.

Primary Level of Failures to Tolerate (PFTT)

The Primary Level of Failures to Tolerate (FTT) dictates how many concurrent host, disk, or fault domain failures a VM object can survive. Values range from 0 to 3 depending on cluster size and hardware availability.

How it works

FTT=0: No redundancy; best for non‑critical or ephemeral workloads.
FTT=1: One mirror or parity protection instance.
FTT=2 or 3: Increasingly higher tolerance, requiring more hosts and capacity.

Advantages

Granular protection that can be tailored per VM.
Easy to modify without storage migration—vSAN automatically reconfigures components.

Disadvantages

Higher FTT increases storage consumption significantly (e.g., FTT=1 mirroring effectively doubles capacity).
Higher FTT requires larger clusters and more fault domains.

Failure Tolerance Method (FTM): Mirroring vs. Erasure Coding

vSAN supports two primary protection mechanisms under FTM, Mirroring (RAID1) or Erasure Coding (RAID5/6). What are the differences:

Mirroring is a data‑protection technique where full, identical copies of data are stored across multiple hosts or storage devices. In vSAN, this is implemented through RAID‑1 (Mirroring), the performance‑optimized protection model. With FTT=1, vSAN stores two full copies across different hosts. With FTT=2, vSAN stores three full copies, and so on

Erasure coding is a data‑protection technique that breaks data into chunks, adds parity information, and distributes those chunks across multiple hosts or storage devices. In vSAN, it is implemented through RAID‑5 (single parity) and RAID‑6 (double parity) policies, both of which provide storage efficiency by using parity instead of full mirroring.

Comparison of the three options, Mirror or two types of Erasure Coding:

Attribute	RAID‑1 (Mirroring)	RAID‑5 (Single‑Parity Erasure Coding)	RAID‑6 (Double‑Parity Erasure Coding)
Protection Method	Full data mirroring across replicas.	Data striping with one parity block per stripe.	Data striping with two parity blocks per stripe.
Minimum Host Requirement (vSAN)	Depends on FTT: 2 hosts for FTT=1, 3+ for higher.	Minimum 4 hosts.	Minimum 6 hosts.
Failures Tolerated	Equal to FTT value (e.g., FTT=1 tolerates 1 mirror/site failure).	Tolerates 1 failure (single parity).	Tolerates 2 failures (dual parity).
Capacity Overhead	Very high: 2× for FTT=1, 3× for FTT=2.	Moderate: ~1.33× overhead in typical RAID‑5 layout.	Efficient: ~1.5× overhead for RAID‑6.
Write Performance	Highest—no parity calculation; synchronous writes to mirrors.	Lower—requires parity computation (read‑modify‑write).	Lower still—two parity calculations required.
Read Performance	High—reads can be serviced from any mirror.	Good—reads use data stripes, parity only for reconstruction.	Good—similar to RAID‑5, with additional parity available.
Best For	Latency‑sensitive, high‑performance workloads (databases, transactional systems).	Capacity‑sensitive workloads with moderate I/O (file services, general VMs).	Large clusters needing strong redundancy with good capacity efficiency (archive, analytics).
Drawbacks	Extremely high capacity consumption.	Higher write latency and CPU cost due to single‑parity overhead.	Highest parity overhead; worst write performance of the three.

Site Disaster Tolerance (SDT)

Mirror:

vSAN Site Disaster Tolerance with mirroring (typically using a Stretched Cluster configuration) works by creating a full replica of data across two geographically separated, active sites, ensuring zero-downtime failover. Using RAID-1 (Mirroring) for site-level protection, a complete copy of the virtual machine is maintained at both the “Preferred” and “Secondary” sites.

How it Works:

Active-Active Sites: Both sites are active. A write to a virtual machine in the Preferred site is synchronously written to the Secondary site across an Inter-Switch Link (ISL).
Witness Host: A third, independent site hosts a vSAN Witness Host, which does not store data but acts as a tie-breaker to prevent “split-brain” scenarios and maintain cluster quorum.
Site Mirroring (RAID-1): In a 2-site configuration with Site Mirroring, the data is mirrored between Site A and Site B.
Local Protection: In addition to site-level mirroring, local data redundancy (RAID-1 or RAID-5/6) can be configured within each site to handle local disk/host failures.

Erasure Coding:

vSAN site disaster tolerance with erasure coding (RAID 5/6) works by splitting data into chunks, creating parity, and distributing them across sites, requiring no full mirror copy. It uses a 3-site setup (Site A, Site B, Witness) to calculate parity on the fly, offering space efficiency over traditional RAID-1 mirroring

Key Details on Erasure Coding in Stretched Clusters:

Mechanism: Instead of mirroring, vSAN uses RAID 5 or 6 (Erasure Coding) to break data into fragments, calculates parity data, and distributes them across sites to tolerate a full site failure.
Site Configuration: A minimum of 3 sites (Site A, Site B, and a Witness host) is required.
Space Efficiency: RAID 5 provides a 1.33x capacity overhead (compared to 2x or 3x for mirroring) to achieve a 1-site failure tolerance, while RAID 6 provides higher protection with less, typically around 1.5x, but requires more hosts.
Requirements: Requires at least 4 nodes in each site (Availability Zone) for RAID 5, and 6 nodes for RAID 6.
Performance vs. Capacity: While erasure coding is more space-efficient, it requires more CPU overhead for parity calculations, making it better for write-heavy workloads that need to save capacity

Secondary Failures to Tolerate (SFTT)

When combining stretched‑cluster protection with intra‑site protection, SFTT defines the number of local failures that can be tolerated within each site.

Advantages

Enables extremely robust protection across and within sites.
Critical for mission‑critical workloads requiring multi‑layered resiliency.

Disadvantages

High capacity overhead.
Requires large host counts due to compound failure models (e.g., site + local host failures).

Comparison Table

SDT + SFTT Scenario	Site Protection	Local Protection	Storage Overhead	Typical Use Case
SDT + SFTT=0	Yes	No	Lowest	Basic DR setups
SDT + SFTT=1	Yes	1 failure (RAID‑1/5)	Medium	General workloads
SDT + SFTT=2	Yes	2 failures (RAID‑6)	High	Mission‑critical, large clusters
SDT + EC Hybrid	Yes	EC in each site	Moderate	Capacity‑efficient DR clusters

Striping (Stripe Width)

Stripe width controls the number of capacity devices an object is distributed across, increasing the potential bandwidth consumption. Listed as “Number of disk stripes per object.”

Advantages

Increases parallelism; potentially boosts performance.
Useful for very large or sequential workloads.

Disadvantages

Consumes additional components, which may stress component count limits.
No guarantee of improved performance unless workload is truly parallel.

In the ESA, the stripe width has a limited impact because the architecture and the LSOM layer are optimized for NVMe speed and parallelism.

IOPS Limit for Object (QoS Controls)

Policies can enforce IOPS limits on a per‑object basis, allowing architects to throttle noisy neighbors or to create performance tiering within the vSAN infrastructure.

Advantages

Enforces fairness across VMs.
Prevents performance degradation in mixed‑tenant environments.

Disadvantages

Misconfigured limits can bottleneck critical workloads.
Requires monitoring to ensure effective placement.

Deduplication/Compression or None

vSAN all‑flash deployments can leverage dedupe/compression at the cluster level, but storage policies can still influence how space efficiency interacts with tiers and remote storage (e.g., via HCI Mesh).

Advantages

Major savings for compressible/dedup‑friendly workloads.
Great fit for VDI or homogeneous datasets.

Disadvantages

Not optimal for encrypted or highly unique workloads.

DDC vs None Across Mirroring & Erasure Coding

Feature	RAID‑1 + DDC	RAID‑1 + None	RAID‑5/6 + DDC	RAID‑5/6 + None
Performance	Medium	Best	Lowest	Medium‑Low
Capacity Efficiency	Medium	Poor	Best	Good
CPU Overhead	Medium‑High	Lowest	Highest	Medium
Ideal For	VDI, repetitive OS disks	Latency‑sensitive DBs	Big data, logs, large pools	General workloads on large clusters
Risk	DDC overhead	High capacity cost	Highest latency path	Parity overhead only

Conclusion

vSAN storage policies are more than configuration items—they’re architectural contracts that describe the exact performance, resiliency, efficiency, and behavior expectations for every workload. SPBM abstracts away the limitations of traditional arrays and empowers architects to deliver bespoke storage outcomes per VM, per disk, and at any time without disruptive migrations.

When properly designed, vSAN policies become a powerful tool to balance capacity, performance, and SLAs across diverse workloads. The key is understanding the trade‑offs and selecting policy combinations that align precisely with application intent and cluster capabilities.

Share on Social Media