This section elaborates on some of the vSAN concepts that have been introduced so far with additional, general information about vSAN caching algorithms. The next paragraphs briefly describe how vSAN leverages flash, memory, and rotating disks. They also illustrate the I/O paths between the guest OS and the persistent storage areas.
Each disk group contains an SSD drive used as a cache tier. On a hybrid system, 70 percent of the capacity is used by default for read cache (RC). The most active data is maintained in the cache tier and improves performance by minimizing the latency impact of reading from mechanical disk.
The RC is organized in terms of cache lines. They represent the unit of data management in RC, and the current size is 1MB. Data is fetched into the RC and evicted at cache-line granularity. In addition to the SSD read cache, vSAN also maintains a small in-memory (RAM) read cache that holds the most-recently accessed cache lines from the RC. The in-memory cache is dynamically sized based on the available memory in the system.
vSAN maintains in-memory metadata that tracks the state of the RC (both SSD and in memory), including the logical addresses of cache lines, valid and invalid regions in each cache line, aging information, etc. These data structures are designed to compress for efficiencies, using memory space without imposing a substantial CPU overhead on regular operations. No need exists to swap RC metadata in or out of persistent storage. (This is one area where VMware holds important IP.)
Read-cache contents are not tracked across power-cycle operations of the host. If power is lost and recovered, then the RC is re-populated (warmed) from scratch. So, essentially RC is used as a fast storage tier, and its persistence is not required across power cycles. The rationale behind this approach is to avoid any overheads on the common data path that would be required if the RC metadata was persisted every time RC was modified, such as cache-line fetching and eviction, or when write operations invalidate a sub-region of a cache line.
Read operations follow a defined procedure. To illustrate, the VMDK in the example below has two replicas on esxi1 and esxi3. See the figure below.
Figure 47. Hybrid read
The operation is shown in the figure below.
Figure 48. All-flash read
The major difference is that read-cache misses cause no serious performance degradation. Reads from flash capacity devices should be almost as quick as reads from the cache SSD. Another significant difference is that no need exists to move the block from the capacity layer to the cache layer, as in hybrid configurations.
In hybrid configurations write-back caching is done entirely for performance. The aggregate-storage workloads in virtualized infrastructures are almost always random, because of the statistical multiplexing of the many VMs and applications that share the infrastructure.
HDDs can perform only a small number of random I/O with a high latency compared to SSDs. So, sending the random write part of the workload directly to spinning disks can cause performance degradation. On the other hand, magnetic disks exhibit decent performance for sequential workloads. Modern HDDs may exhibit sequential-like behavior and performance even when the workload is not perfectly sequential. “Proximal I/O” suffices.
In hybrid disk groups, vSAN uses the write-buffer (WB) section of the SSD (by default, 30 percent of device capacity), as a write-back buffer that stages all the write operations. The key objective is to de-stage written data (not individual write operations) in a way creates a benign, near-sequential (proximal) write workload for the HDDs that form the capacity tier.
In all-flash disk groups, vSAN utilizes the tier-1 SSD entirely as a write-back buffer (100 percent of the device capacity—up to a maximum of 600GB). The purpose of the WB is quite different in this case. It absorbs the highest rate of write operations in a high-endurance device and allows only a trickle of data to be written to the capacity flash. This approach allows low-endurance, larger-capacity SSDs for capacity.
Nevertheless, capacity SSDs are capable of serving very large numbers of read IOPS. Thus, no read caching occurs, except when the most-recent data referenced by a read operation still resides in the WB.
In both hybrid and All-flash, every write operation is handled through transactional processes: A record for the operation is persisted in the transaction log in the SSD.
The data (payload) of the operation is persisted in the WB.
Updated in-memory tables reflect the new data and its logical address space (for tracking) as well as its physical location in the capacity tier.
The write operation completes upstream after the transaction has committed successfully.
Under typical steady-state workloads, the log records of multiple write operations are coalesced before they are persisted in the log. This reduces the amount of metadata-write operations for the SSD. By definition, the log is a circular buffer, written and freed in a sequential fashion. Thus write amplification can be avoided (good for device endurance). The WB region allocates blocks in a round-robin fashion, keeping wear leveling in mind.
Even when a write operation overwrites existing WB data, vSAN never rewrites an existing SDD page in place. Instead, it allocates a new block and updates metadata to reflect that the old blocks are invalid. vSAN fills an entire SSD page before it moves to the next one. Eventually, entire pages are freed when all their data is invalid. (It is very rare to re-buffer data to allow SSD pages to be freed).
Also, because the device firmware does not have visibility into invalidated data, it sees no “holes” in pages. In effect, internal write leveling (by moving data around to fill holes in pages) is all but eliminated. This extends the overall endurance of a device. The vSAN design has gone to great lengths to minimize unnecessary writes to maximize cache SSD endurance. As a result, the life expectancy of SSDs implemented in vSAN may exceed the manufacturers’ specifications, which are developed with more generic workloads in mind.
The figure below illustrates the operation.
Figure 49. Hybrid and flash write I/O
vSAN caching algorithms and data-locality techniques reflect a number of objectives and observations pertaining to distributed caching:
vSAN exploits temporal and spatial locality for caching.
vSAN implements a distributed, persistent cache on flash across the cluster. Caching is done in front of the disks where the data replicas live, not on the client side. A distributed-caching mechanism results in better overall flash-cache utilization.
Another benefit of distributed caching is during VM migrations, which can happen in some datacenters over ten times a day. With DRS and vMotion, VMs can move around from host to host in a cluster. Without a distributed cache, the migrations would have to move around a lot of data and rewarm caches every time a VM migrates. As the figure below illustrates, vSAN prevents any performance degradation after a VM migration.
Figure 50. vSAN prevents performance degradation after VM migration
The network introduces a small latency when accessing data on another host. Typical latencies in 10GbE networks range from 5 – 50 microseconds. Typical latencies of a flash drive, accessed through a SCSI layer, are near 1ms for small (4K) I/O blocks. So, for the majority of the I/O executed in the system, the network impact adds near 0.1 percent to the latency.
Few workloads are actually cache friendly, meaning that they don’t take advantage of the way small increases in cache size can significantly increase the rate of I/O. These workloads can benefit from local cache, and enabling the Client Cache would be the right approach.
vSAN works with a View Accelerator (deduplicated, in-memory read cache), which is notably effective for VDI use cases.
vSAN features Client Cache that leverages DRAM memory local to the virtual machine to accelerate read performance. The amount of memory allocated is anywhere from .4 percent to 1GB per host.