OneFS Endurant Cache
Tue, 22 Mar 2022 18:27:04 -0000|
Read Time: 0 minutes
My earlier blog post on multi-threaded I/O generated several questions on synchronous writes in OneFS. So, this seemed like a useful topic to explore in a bit more detail.
OneFS natively provides a caching mechanism for synchronous writes – or writes that require a stable write acknowledgement to be returned to a client. This functionality is known as the Endurant Cache, or EC.
The EC operates in conjunction with the OneFS write cache, or coalescer, to ingest, protect, and aggregate small synchronous NFS writes. The incoming write blocks are staged to NVRAM, ensuring the integrity of the write, even during the unlikely event of a node’s power loss. Furthermore, EC also creates multiple mirrored copies of the data, further guaranteeing protection from single node and, if desired, multiple node failures.
EC improves the latency associated with synchronous writes by reducing the time to acknowledgement back to the client. This process removes the Read-Modify-Write (R-M-W) operations from the acknowledgement latency path, while also leveraging the coalescer to optimize writes to disk. EC is also tightly coupled with OneFS’ multi-threaded I/O (Multi-writer) process, to support concurrent writes from multiple client writer threads to the same file. And the design of EC ensures that the cached writes do not impact snapshot performance.
The endurant cache uses write logging to combine and protect small writes at random offsets into 8KB linear writes. To achieve this, the writes go to special mirrored files, or ‘Logstores’. The response to a stable write request can be sent once the data is committed to the logstore. Logstores can be written to by several threads from the same node and are highly optimized to enable low-latency concurrent writes.
Note that if a write uses the EC, the coalescer must also be used. If the coalescer is disabled on a file, but EC is enabled, the coalescer will still be active with all data backed by the EC.
So what exactly does an endurant cache write sequence look like?
Say an NFS client wishes to write a file to a PowerScale cluster over NFS with the O_SYNC flag set, requiring a confirmed or synchronous write acknowledgement. Here is the sequence of events that occurs to facilitate a stable write.
1. A client, connected to node 3, begins the write process sending protocol level blocks. 4K is the optimal block size for the endurant cache.
2. The NFS client’s writes are temporarily stored in the write coalescer portion of node 3’s RAM. The Write Coalescer aggregates uncommitted blocks so that the OneFS can, ideally, write out full protection groups where possible, reducing latency over protocols that allow “unstable” writes. Writing to RAM has far less latency that writing directly to disk.
3. Once in the write coalescer, the endurant cache log-writer process writes mirrored copies of the data blocks in parallel to the EC Log Files.
The protection level of the mirrored EC log files is the same as that of the data being written by the NFS client.
4. When the data copies are received into the EC Log Files, a stable write exists and a write acknowledgement (ACK) is returned to the NFS client confirming the stable write has occurred. The client assumes the write is completed and can close the write session.
5. The write coalescer then processes the file just like a non-EC write at this point. The write coalescer fills and is routinely flushed as required as an asynchronous write to the block allocation manager (BAM) and the BAM safe write (BSW) path processes.
6. The file is split into 128K data stripe units (DSUs), parity protection (FEC) is calculated, and FEC stripe units (FSUs) are created.
7. The layout and write plan is then determined, and the stripe units are written to their corresponding nodes’ L2 Cache and NVRAM. The EC logfiles are cleared from NVRAM at this point. OneFS uses a Fast Invalid Path process to de-allocate the EC Log Files from NVRAM.
8. Stripe Units are then flushed to physical disk.
9. Once written to physical disk, the data stripe Unit (DSU) and FEC stripe unit (FSU) copies created during the write are cleared from NVRAM but remain in L2 cache until flushed to make room for more recently accessed data.
As far as protection goes, the number of logfile mirrors created by EC is always one more than the on-disk protection level of the file. For example:
File Protection Level
Number of EC Mirrored Copies
The EC mirrors are only used if the initiator node is lost. In the unlikely event that this occurs, the participant nodes replay their EC journals and complete the writes.
If the write is an EC candidate, the data remains in the coalescer, an EC write is constructed, and the appropriate coalescer region is marked as EC. The EC write is a write into a logstore (hidden mirrored file) and the data is placed into the journal.
Assuming the journal is sufficiently empty, the write is held there (cached) and only flushed to disk when the journal is full, thereby saving additional disk activity.
An optimal workload for EC involves small-block synchronous, sequential writes – something like an audit or redo log, for example. In that case, the coalescer will accumulate a full protection group’s worth of data and be able to perform an efficient FEC write.
The happy medium is a synchronous small block type load, particularly where the I/O rate is low and the client is latency-sensitive. In this case, the latency will be reduced and, if the I/O rate is low enough, it won’t create serious pressure.
The undesirable scenario is when the cluster is already spindle-bound and the workload is such that it generates a lot of journal pressure. In this case, EC is just going to aggravate things.
So how exactly do you configure the endurant cache?
Although on by default, setting the efs.bam.ec.mode sysctl to value ‘1’ will enable the Endurant Cache:
# isi_sysctl_cluster efs.bam.ec.mode=1
EC can also be enabled and disabled per directory:
# isi set -c [on|off|endurant_all|coal_only] <directory_name>
To enable the coalescer but switch off EC, run:
# isi set -c coal_only
And to disable the endurant cache completely:
# isi_for_array –s isi_sysctl_cluster efs.bam.ec.mode=0
A return value of zero on each node from the following command will verify that EC is disabled across the cluster:
# isi_for_array –s sysctl efs.bam.ec.stats.write_blocks efs.bam.ec.stats.write_blocks: 0
If the output to this command is incrementing, EC is delivering stable writes.
Be aware that if the Endurant Cache is disabled on a cluster, the sysctl efs.bam.ec.stats.write_blocks output on each node will not return to zero, because this sysctl is a counter, not a rate. These counters won’t reset until the node is rebooted.
As mentioned previously, EC applies to stable writes, namely:
- Writes with O_SYNC and/or O_DIRECT flags set
- Files on synchronous NFS mounts
When it comes to analyzing any performance issues involving EC workloads, consider the following:
- What changed with the workload?
- If upgrading OneFS, did the prior version also have EC enabled?
If the workload has moved to new cluster hardware:
- Does the performance issue occur during periods of high CPU utilization?
- Which part of the workload is creating a deluge of stable writes?
- Was there a large change in spindle or node count?
- Has the OneFS protection level changed?
- Is the SSD strategy the same?
Disabling EC is typically done cluster-wide and this can adversely impact certain workflow elements. If the EC load is localized to a subset of the files being written, an alternative way to reduce the EC heat might be to disable the coalescer buffers for some particular target directories, which would be a more targeted adjustment. This can be configured using the isi set –c off command.
One of the more likely causes of performance degradation is from applications aggressively flushing over-writes and, as a result, generating a flurry of ‘commit’ operations. This can generate heavy read/modify/write (r-m-w) cycles, inflating the average disk queue depth, and resulting in significantly slower random reads. The isi statistics protocol CLI command output will indicate whether the ‘commit’ rate is high.
It’s worth noting that synchronous writes do not require using the NFS ‘sync’ mount option. Any programmer who is concerned with write persistence can simply specify an O_FSYNC or O_DIRECT flag on the open() operation to force synchronous write semantics for that file handle. With Linux, writes using O_DIRECT will be separately accounted for in the Linux ‘mountstats’ output. Although it’s almost exclusively associated with NFS, the EC code is actually protocol-agnostic. If writes are synchronous (write-through) and are either misaligned or smaller than 8k, they have the potential to trigger EC, regardless of the protocol.
The endurant cache can provide a significant latency benefit for small (such as 4K), random synchronous writes – albeit at a cost of some additional work for the system.
However, it’s worth bearing the following caveats in mind:
- EC is not intended for more general purpose I/O.
- There is a finite amount of EC available. As load increases, EC can potentially ‘fall behind’ and end up being a bottleneck.
- Endurant Cache does not improve read performance, since it’s strictly part of the write process.
- EC will not increase performance of asynchronous writes – only synchronous writes.
Author: Nick Trimbee
Related Blog Posts
Boosting Storage Performance for Media and Entertainment with RDMA
Tue, 02 Nov 2021 20:20:27 -0000|
Read Time: 0 minutes
We are in a new golden era of content creation. The explosion of streaming services has brought an unprecedented volume of new and amazing media. Production, post-production, visual effects, animation, finishing: everyone is booked solid with work. And the expectations for this content are higher than ever, with new technically challenging formats becoming the norm rather than the exception. Anyone who has had to work with this content knows that even in 2021, working natively with 8K video or high frame rate 4K video is no joke.
During post, storage and workstation performance can be huge bottlenecks. These bottlenecks can be particularly painful for “hero” seats that are tasked with working in real time with uncompressed media.
So, let’s look at a new PowerScale OneFS 9.2 feature that can improve storage and workstation performance simultaneously. That technology is Remote Direct Memory Access (RDMA), and specifically NFS over RDMA.
Why NFS? Linux is still the operating system of choice for the applications that media professionals use to work with the most challenging media. Even if those applications have Windows or macOS variants, the Linux version is what is used in the truly high-end. And the native way for a Linux computer to access network storage is NFS. In particular, NFS over TCP.
Already this article is going down a rabbit hole of acronyms! I imagine that most people reading are already familiar with NFS (and SMB) and TCP (and UDP) and on and on. For the benefit of those folks who are not, NFS stands for Network File System. NFS is how Linux systems talk to network storage (there are other ways, but mostly, it is NFS). NFS traffic sits on top of other lower-level network protocols, in particular TCP (or UDP, but mostly it is TCP). TCP does a great job of handling things like packet loss on congested networks, but that comes with performance implications. Back to RDMA.
As the name implies, RDMA is a protocol that allows for a client system to copy data from a storage server’s memory directly into that client’s own memory. And in doing so, the client system bypasses many of the buffering layers inherent in TCP. This direct communication improves storage throughput and reduces latency in moving data between server and client. It also reduces CPU load on both the client and storage server.
RDMA was developed in the 1990s to support high performance compute workloads running over InfiniBand networks. In the 2000s, two methods of running RDMA over Ethernet networks were developed: iWARP and RoCE. Without going into too much detail, iWARP uses TCP for RDMA communications and RoCE uses UDP. There are various benefits and drawbacks of these two approaches. iWARP’s reliance on TCP allows for greater flexibility in network design, but suffers from many of the same performance drawbacks of native TCP communications. RoCE has reduced CPU overhead as compared to iWARP, but requires a lossless network. Once again, without going into too much detail, RoCE is the clear winner given that we are looking for the maximum storage performance with the lowest CPU load. And that is exactly what PowerScale OneFS uses, RoCE (actually RoCEv2, also known as Routable RoCE or RRoCE).
So, put that all together, and you can run NFS traffic over RDMA leveraging RoCE! Yes, back into alphabet soup land. But what this means is that if your environment and PowerScale storage nodes support it, you can massively boost performance by mounting the network storage with a few mount options. And that is a neat trick. The performance gains of RDMA are impressive. In some cases, RDMA is twice as performant as TCP, all other things being equal (with a similar drop in workstation utilization).
A good place to start learning if your PowerScale nodes support RDMA is my colleague Nick Trimbee’s excellent blog: Unstructured Data Tips.
Let’s bring this back to media creation and look at some real-world examples that were tested for this article. The first example is playing an uncompressed 8K DPX image sequence in DaVinci Resolve. Uncompressed video puts less of a strain on the workstation (no real-time decompression), but the file sizes and bandwidth requirements are huge. As an image sequence, each frame of video is a separate file, and at 8K resolution, that meant that each file was approximately 190 MB. To sustain 24 frames per second playback requires 4.5 GB! Long story short, the image sequence would not play with the storage mounted using TCP. Mounting the exact same storage using RDMA was a night and day difference: 8K video at 24 frames per second in Resolve over the network.
Now let’s look at workstation performance. Because to be fair, uncompressed 8K video is unwieldy to store or work with. The number of facilities truly working in uncompressed 8K is small. Far more common is a format such as 6K PIZ compressed OpenEXR. OpenEXR is another image sequence format (file per frame) and PIZ compression is lossless, retaining full image fidelity. The PIZ compressed image sequence I used here had frames that were between 80 MB and 110 MB each. To sustain 24 frames per second requires around 2.7 GB. This bandwidth is less than uncompressed 8K but still substantial. However, the real challenge is that the workstation needs to decompress each frame of video as it is being read. Pulling the 6K image sequence into DaVinci Resolve again and attempting playback over the network storage mounted using TCP did not work. The combination of CPU cycles required for reading the files over network storage and decoding each 6K frame were too much. RDMA was the key for this kind of playback. And sure enough, remounting the storage using RDMA enabled smooth playback of this OpenEXR 6K PIZ image sequence over the network in Resolve.
Going a little deeper with workstation performance, let us look at some other common video formats: Sony XAVC and Apple ProRes 422HQ both at full 4K DCI resolution and 59.94 frames per second. This time AutoDesk Flame 2022 is used as the playback application. Flame has a debug mode that shows video disk dropped frames, GPU dropped frames, and broadcast output dropped frames. With the file system mounted using TCP or RDMA, the video disk never dropped a frame.
The storage is plenty fast enough. However, with the file system mounted using TCP, the broadcast output dropped thousands of frames, and the workstation could not keep up. Playing back the material over RDMA was a different story, smooth broadcast output and essentially no dropped frames at all. In this case, it was all about the CPU cycles freed up by RDMA.
NFS over RDMA is a big deal for PowerScale OneFS environments supporting the highest end playback. The twin benefits of storage performance and workstation CPU savings change what is possible with network storage. For more specifics about the storage environment, the tests run, and how to leverage NFS over RDMA, see my detailed white paper PowerScale OneFS: NFS over RDMA for Media.
Author: Gregory Shiff, Principal Solutions Architect, Media & Entertainment LinkedIn
Accelerating your Network File System (NFS) Workloads with RDMA
Wed, 15 Sep 2021 13:19:26 -0000|
Read Time: 0 minutes
The NFS protocol is widely used in datacenters by NAS storage nowadays. It was originally designed for storing and managing data centrally, then sharing data across networks. As technologies evolved, NFS has been used for critical production workloads by many organizations.
NFS is usually implemented over TCP to transfer data. With the emergence of higher speed Ethernet and heavier application workloads running in datacenters, the speed of transferring the ever-increasing volume of data is critical to organizations. The industry has been pursuing new ways to improve NFS protocol performance and to adapt to emerging application workloads. This has made possible using NFS over Remote Direct Memory Access (RDMA).
RDMA enables accessing memory data on a remote machine without passing the data through the CPUs on the system. RDMA therefore enables data to be transferred between storage and clients with higher throughput and lower CPU usage. NFS over RDMA, as defined in RFC8267, uses the advantages of RDMA. Starting with OneFS 9.2.0, OneFS supports NFSv3 over RDMA based on the ROCEv2 (also known as Routable RoCE or RRoCE) network protocol.
To evaluate the improvements and advantages of NFSv3 over RDMA, as compared to NFSv3 over TCP, we ran some FIO sequential read tests, and observed the throughput and CPU usage under different thread counts. The following figure shows the test environment topology and resource configuration.
CentOS Linux release 8.3.2011
Dell PowerEdge C4140 clients
2 * MT28800 Family [ConnectX-5 Ex] * 100GE
2 * MT28908 Family [ConnectX-6] * 100GE
The following chart shows the throughput comparison for RDMA vs. TCP. We found that NFSv3 over RDMA delivers higher throughput than NFSv3 over TCP. (Note: Because 10 test clients cannot overload a 48-node F600 cluster, the throughput number is only used for RDMA and TCP comparison and does not represent the maximum cluster performance.)
The following chart shows the clients’ CPU usage comparison for RDMA vs. TCP. We found that clients consume fewer CPU resources when using NFSv3 over RDMA.
The NFSv3 over RDMA performance improvement does vary as the client thread number increases, as compared to NFSv3 over TCP. Overall, NFSv3 over RDMA delivers higher throughput while providing significant reduction of clients’ CPU overhead. Sequential workloads and CPU-intensive workloads can therefore benefit from using NFSv3 over RDMA on OneFS.