Understanding vSAN Storage Policies, Their Strengths, and Their Trade‑offs

Storage Policy–Based Management (SPBM) is the backbone of how VMware vSAN delivers predictable, workload‑aligned outcomes. Instead of carving LUNs or managing fixed RAID groups the old-fashioned way, policies define the storage behavior of each VM and each VMDK—granular, dynamic, and automated. This approach simplifies operations, enables workload‑specific tuning, and eliminates the rigidity of traditional storage constructs.

Below, we’ll explore the key vSAN storage policy components, along with their advantages and disadvantages, from an architectural design perspective.

Primary Level of Failures to Tolerate (PFTT)

The Primary Level of Failures to Tolerate (FTT) dictates how many concurrent host, disk, or fault domain failures a VM object can survive. Values range from 0 to 3 depending on cluster size and hardware availability.

How it works

  • FTT=0: No redundancy; best for non‑critical or ephemeral workloads.
  • FTT=1: One mirror or parity protection instance.
  • FTT=2 or 3: Increasingly higher tolerance, requiring more hosts and capacity.

Advantages

  • Granular protection that can be tailored per VM.
  • Easy to modify without storage migration—vSAN automatically reconfigures components.

Disadvantages

  • Higher FTT increases storage consumption significantly (e.g., FTT=1 mirroring effectively doubles capacity).
  • Higher FTT requires larger clusters and more fault domains.

Failure Tolerance Method (FTM): Mirroring vs. Erasure Coding

vSAN supports two primary protection mechanisms under FTM, Mirroring (RAID1) or Erasure Coding (RAID5/6). What are the differences:


Mirroring is a data‑protection technique where full, identical copies of data are stored across multiple hosts or storage devices. In vSAN, this is implemented through RAID‑1 (Mirroring), the performance‑optimized protection model. With FTT=1, vSAN stores two full copies across different hosts. With FTT=2, vSAN stores three full copies, and so on


Erasure coding is a data‑protection technique that breaks data into chunks, adds parity information, and distributes those chunks across multiple hosts or storage devices. In vSAN, it is implemented through RAID‑5 (single parity) and RAID‑6 (double parity) policies, both of which provide storage efficiency by using parity instead of full mirroring.

Comparison of the three options, Mirror or two types of Erasure Coding:

AttributeRAID‑1 (Mirroring)RAID‑5 (Single‑Parity Erasure Coding)RAID‑6 (Double‑Parity Erasure Coding)
Protection MethodFull data mirroring across replicas. Data striping with one parity block per stripe. Data striping with two parity blocks per stripe.
Minimum Host Requirement (vSAN)Depends on FTT: 2 hosts for FTT=1, 3+ for higher. Minimum 4 hosts. Minimum 6 hosts.
Failures ToleratedEqual to FTT value (e.g., FTT=1 tolerates 1 mirror/site failure). Tolerates 1 failure (single parity). Tolerates 2 failures (dual parity).
Capacity OverheadVery high: 2× for FTT=1, 3× for FTT=2. Moderate: ~1.33× overhead in typical RAID‑5 layout.Efficient: ~1.5× overhead for RAID‑6.
Write PerformanceHighest—no parity calculation; synchronous writes to mirrors. Lower—requires parity computation (read‑modify‑write). Lower still—two parity calculations required.
Read PerformanceHigh—reads can be serviced from any mirror.Good—reads use data stripes, parity only for reconstruction.Good—similar to RAID‑5, with additional parity available.
Best ForLatency‑sensitive, high‑performance workloads (databases, transactional systems).Capacity‑sensitive workloads with moderate I/O (file services, general VMs).Large clusters needing strong redundancy with good capacity efficiency (archive, analytics).
DrawbacksExtremely high capacity consumption.Higher write latency and CPU cost due to single‑parity overhead.Highest parity overhead; worst write performance of the three.

Site Disaster Tolerance (SDT)

Mirror:

vSAN Site Disaster Tolerance with mirroring (typically using a Stretched Cluster configuration) works by creating a full replica of data across two geographically separated, active sites, ensuring zero-downtime failover. Using RAID-1 (Mirroring) for site-level protection, a complete copy of the virtual machine is maintained at both the “Preferred” and “Secondary” sites.

How it Works:

  • Active-Active Sites: Both sites are active. A write to a virtual machine in the Preferred site is synchronously written to the Secondary site across an Inter-Switch Link (ISL).
  • Witness Host: A third, independent site hosts a vSAN Witness Host, which does not store data but acts as a tie-breaker to prevent “split-brain” scenarios and maintain cluster quorum.
  • Site Mirroring (RAID-1): In a 2-site configuration with Site Mirroring, the data is mirrored between Site A and Site B.
  • Local Protection: In addition to site-level mirroring, local data redundancy (RAID-1 or RAID-5/6) can be configured within each site to handle local disk/host failures.

Erasure Coding:

vSAN site disaster tolerance with erasure coding (RAID 5/6) works by splitting data into chunks, creating parity, and distributing them across sites, requiring no full mirror copy. It uses a 3-site setup (Site A, Site B, Witness) to calculate parity on the fly, offering space efficiency over traditional RAID-1 mirroring

Key Details on Erasure Coding in Stretched Clusters:

  • Mechanism: Instead of mirroring, vSAN uses RAID 5 or 6 (Erasure Coding) to break data into fragments, calculates parity data, and distributes them across sites to tolerate a full site failure.
  • Site Configuration: A minimum of 3 sites (Site A, Site B, and a Witness host) is required.
  • Space Efficiency: RAID 5 provides a 1.33x capacity overhead (compared to 2x or 3x for mirroring) to achieve a 1-site failure tolerance, while RAID 6 provides higher protection with less, typically around 1.5x, but requires more hosts.
  • Requirements: Requires at least 4 nodes in each site (Availability Zone) for RAID 5, and 6 nodes for RAID 6.
  • Performance vs. Capacity: While erasure coding is more space-efficient, it requires more CPU overhead for parity calculations, making it better for write-heavy workloads that need to save capacity

Secondary Failures to Tolerate (SFTT)

When combining stretched‑cluster protection with intra‑site protection, SFTT defines the number of local failures that can be tolerated within each site.

Advantages

  • Enables extremely robust protection across and within sites.
  • Critical for mission‑critical workloads requiring multi‑layered resiliency.

Disadvantages

  • High capacity overhead.
  • Requires large host counts due to compound failure models (e.g., site + local host failures).

Comparison Table

SDT + SFTT ScenarioSite ProtectionLocal ProtectionStorage OverheadTypical Use Case
SDT + SFTT=0YesNoLowestBasic DR setups
SDT + SFTT=1Yes1 failure (RAID‑1/5)MediumGeneral workloads
SDT + SFTT=2Yes2 failures (RAID‑6)HighMission‑critical, large clusters
SDT + EC HybridYesEC in each siteModerateCapacity‑efficient DR clusters

Striping (Stripe Width)

Stripe width controls the number of capacity devices an object is distributed across, increasing the potential bandwidth consumption. Listed as “Number of disk stripes per object.”

Advantages

  • Increases parallelism; potentially boosts performance.
  • Useful for very large or sequential workloads.

Disadvantages

  • Consumes additional components, which may stress component count limits.
  • No guarantee of improved performance unless workload is truly parallel.

In the ESA, the stripe width has a limited impact because the architecture and the LSOM layer are optimized for NVMe speed and parallelism.

IOPS Limit for Object (QoS Controls)

Policies can enforce IOPS limits on a per‑object basis, allowing architects to throttle noisy neighbors or to create performance tiering within the vSAN infrastructure.

Advantages

  • Enforces fairness across VMs.
  • Prevents performance degradation in mixed‑tenant environments.

Disadvantages

  • Misconfigured limits can bottleneck critical workloads.
  • Requires monitoring to ensure effective placement.

Deduplication/Compression or None

vSAN all‑flash deployments can leverage dedupe/compression at the cluster level, but storage policies can still influence how space efficiency interacts with tiers and remote storage (e.g., via HCI Mesh).

Advantages

  • Major savings for compressible/dedup‑friendly workloads.
  • Great fit for VDI or homogeneous datasets.

Disadvantages

Not optimal for encrypted or highly unique workloads.

DDC vs None Across Mirroring & Erasure Coding

FeatureRAID‑1 + DDCRAID‑1 + NoneRAID‑5/6 + DDCRAID‑5/6 + None
PerformanceMediumBestLowestMedium‑Low
Capacity EfficiencyMediumPoorBestGood
CPU OverheadMedium‑HighLowestHighestMedium
Ideal ForVDI, repetitive OS disksLatency‑sensitive DBsBig data, logs, large poolsGeneral workloads on large clusters
RiskDDC overheadHigh capacity costHighest latency pathParity overhead only

Conclusion

vSAN storage policies are more than configuration items—they’re architectural contracts that describe the exact performance, resiliency, efficiency, and behavior expectations for every workload. SPBM abstracts away the limitations of traditional arrays and empowers architects to deliver bespoke storage outcomes per VM, per disk, and at any time without disruptive migrations.

When properly designed, vSAN policies become a powerful tool to balance capacity, performance, and SLAs across diverse workloads. The key is understanding the trade‑offs and selecting policy combinations that align precisely with application intent and cluster capabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *

Share on Social Media