A Metadata-based Approach to Tiering in PowerScale OneFS
Wed, 02 Mar 2022 22:56:32 -0000
|Read Time: 0 minutes
OneFS SmartPools provides sophisticated tiering between storage node types. Rules based on file attributes such as last accessed time or creation date can be configured in OneFS to drive transparent motion of data between PowerScale node types. This kind of “set and forget” approach to data tiering is ideal for some industries but not workable for most content creation workflows.
A classic case of how this kind of tiering falls short for media is the real-time nature of video playback. For an extreme example, take an uncompressed 4K image sequence (or even 8K), that might require >1.5GB/s of throughput to play properly. If this media has been tiered down to low performing archive storage and it needs to be used, those files must be migrated back up before they will play. This problem causes delays and confusion all around and makes media storage administrators hesitant to archive anything.
The good news is that the PowerScale OneFS ecosystem has a better way of doing things!
The approach I have taken here is to pull metadata from elsewhere in the workflow and use it to drive on demand tiering in OneFS. How does that work? OneFS supports file extended attributes, which are <key/value> pairs (metadata!) that can be written to the files and directories stored in OneFS. File Policies can be configured in OneFS to move data based on those file extended attributes. And a SmartPoolsTree job can be run on only the path that needs to be moved. All this goodness can be controlled externally by combining the DataIQ API and the OneFS API.
Figure 1: API flow
Note that while I’m focused on combining the DataIQ and OneFS APIs in this post, other API driven tools with OneFS file system visibility could be substituted for DataIQ.
DataIQ
DataIQ is a data indexing and analysis tool. It runs as an external virtual machine and maintains an index of mounted file systems. DataIQ’s file system crawler is efficient, fast, and lightweight, meaning it can be kept up to date with little impact on the storage devices it is indexing.
DataIQ has a concept called “tagging”. Tags in DataIQ apply to directories and provide a mechanism for reporting sets of related data. A tag in DataIQ is an arbitrary <key>/<value> pair. Directories can be tagged in DataIQ in three different ways:
- Autotagging rules:
- Tags are automatically placed in the file system based on regular expressions defined in the Autotagging configuration menu.
- Use of .cntag files:
- Empty files named in the format <key>.<value>.cntag are placed in directories and will be recognized as tags by DataIQ.
- API-based tagging:
- The DataIQ API allows for external tagging of directories.
Tags can be placed throughout a file system and then reported on as a group. For instance, temporary render directories could contain a render.temp.cntag file. Similarly, an external tool could access the DataIQ API and place a <Project/Name> tag on the top-level directory of each project. DataIQ can generate reports on the storage capacity those tags are consuming.
File system extended attributes in OneFS
As I mentioned earlier, OneFS supports file extended attributes. Extended attributes are arbitrary metadata tags in the form of <key/value> pairs that can be applied to files and directories. Extended attributes are not visible in the graphical interface or when accessing files over a share or export. However, the attributes can be accessed using the OneFS CLI with the getexattr and setextattr commands.
Figure 2: File extended attributes
The SmartPools job engine will move data between node pools based on these file attributes. And it is that SmartPools functionality that uses this metadata to perform on demand data tiering.
Crucially, OneFS supports creation of file system extended attributes from an external script using the OneFS REST API. The OneFS API Reference Guide has great information about setting and reading back file system extended attributes.
Figure 3: File policy configuration
Tiering example with Autodesk Shotgrid, DataIQ, and OneFS
Autodesk ShotGrid (formerly Shotgun) is a production resource management tool common in the visual effects and animation industries. ShotGrid is a cloud-based tool that allows for coordination of large production teams. Although it isn’t a storage management tool, its business logic can be useful in deciding what tier of storage a particular set of files should live on. For instance, if a shot tracked in ShotGrid is complete and delivered, the files associated with that shot could be moved to archive.
DataIQ plug-in for Autodesk ShotGrid
The open-source DataIQ plug-in for ShotGrid is available on GitHub here:
Dell DataIQ Autodesk ShotGrid Plugin
This plug-in is proof of concept code to show how the ShotGrid and DataIQ APIs can be combined to tag data in DataIQ based on shot status in ShotGrid. The DataIQ tags are dynamically updated with the current shot status in ShotGrid.
Here is a “shot” in ShotGrid configured with various possible statuses:
Figure 4: ShotGrid status
The following figure of DataIQ shows where the shot status field from ShotGrid has been automatically applied as a tag in DataIQ.
Figure 5: DataIQ tags
Once metadata from ShotGrid has been pulled into DataIQ, that information can be used to drive OneFS SmartPools tiering:
- A user (or system) passes the DataIQ tag <key/values> to the DataIQ API. The DataIQ API returns a list of directories associated with that tag.
- A directory chosen from Step 1 above can be passed back to the DataIQ API to get a listing of all contents by way of the DataIQ file index.
- Those items are passed programmatically to the OneFS API. The <key/value> pair of the original DataIQ tag is written as an extended attribute directly to the targeted files and directories.
- And finally, the SmartPoolsTree job can be run on the parent path chosen in Step 2 above to begin tiering the data immediately.
Using business logic to drive storage tiering
DataIQ and OneFS provide the APIs necessary to drive storage tiering based on business logic. Striking efficiencies can be gained by taking advantage of the metadata that exists in many workflow tools. It is a matter of “connecting the dots”.
The example in this blog uses ShotGrid and DataIQ, however it is easy to imagine that similar metadata-based techniques could be developed using other file system index tools. In the media and entertainment ecosystem, media asset management and production asset management systems immediately come to mind as candidates for this kind of API level integration.
As data volumes increase exponentially, it is unrealistic to keep all files on the highest costing tiers of storage. Various automated storage tiering approaches have been around for years, but for many use cases this automated tiering approach falls short. Bringing together rich metadata and an API driven workflow bridges the gap.
To see the Python required to put this process together, refer to my white paper PowerScale OneFS: A Metadata Driven Approach to On Demand Tiering.
Author: Gregory Shiff, Principal Solutions Architect, Media & Entertainment LinkedIn
Related Blog Posts
OneFS In-line Dedupe
Thu, 12 May 2022 14:48:01 -0000
|Read Time: 0 minutes
Among the features and functionality delivered in the new OneFS 9.4 release is the promotion of in-line dedupe to enabled by default, further enhancing PowerScale’s dollar-per-TB economics, rack density and value.
Part of the OneFS data reduction suite, in-line dedupe initially debuted in OneFS 8.2.1. However, it was enabled manually, so many customers simply didn’t use it. But with this enhancement, new clusters running OneFS 9.4 now have in-line dedupe enabled by default.
Cluster configuration | In-line dedupe | In-line compression |
New cluster running OneFS 9.4 | Enabled | Enabled |
New cluster running OneFS 9.3 or earlier | Disabled | Enabled |
Cluster with in-line dedupe enabled that is upgraded to OneFS 9.4 | Enabled | Enabled |
Cluster with in-line dedupe disabled that is upgraded to OneFS 9.4 | Disabled | Enabled |
That said, any clusters that upgrade to 9.4 will not see any change to their current in-line dedupe config during upgrade. Also, there is also no change to the behavior for in-line compression, which remains enabled by default in all OneFS versions from 8.1.3 onwards.
But before we examine the-under-the-hood changes in OneFS 9.4, let’s have a quick dedupe refresher.
Currently, OneFS in-line data reduction, which encompasses compression, dedupe, and zero block removal, is supported on the F900, F600, and F200 all-flash nodes, plus the F810, H5600, H700/7000, and A300/3000 Gen6.x chassis.
Within the OneFS data reduction pipeline, zero block removal is performed first, followed by dedupe, and then compression. This order allows each phase to reduce the scope of work each subsequent phase.
Unlike SmartDedupe, which performs deduplication once data has been written to disk, or post-process, in-line dedupe acts in real time, deduplicating data as is ingested into the cluster. Storage efficiency is achieved by scanning the data for identical blocks as it is received and then eliminating the duplicates.
When in-line dedupe discovers a duplicate block, it moves a single copy of the block to a special set of files known as shadow stores. These are file-system containers that allow data to be stored in a sharable manner. As such, files stored under OneFS can contain both physical data and pointers, or references, to shared blocks in shadow stores.
Shadow stores are similar to regular files but are hidden from the file system namespace, so they cannot be accessed through a pathname. A shadow store typically grows to a maximum size of 2 GB, which is around 256 K blocks, and each block can be referenced by 32,000 files. If the reference count limit is reached, a new block is allocated, which may or may not be in the same shadow store. Also, shadow stores do not reference other shadow stores. And snapshots of shadow stores are not permitted because the data contained in shadow stores cannot be overwritten.
When a client writes a file to a node pool configured for in-line dedupe on a cluster, the write operation is divided up into whole 8 KB blocks. Each block is hashed, and its cryptographic ‘fingerprint’ is compared against an in-memory index for a match. At this point, one of the following will happen:
- If a match is discovered with an existing shadow store block, a byte-by-byte comparison is performed. If the comparison is successful, the data is removed from the current write operation and replaced with a shadow reference.
- When a match is found with another LIN, the data is written to a shadow store instead and is replaced with a shadow reference. Next, a work request is generated and queued that includes the location for the new shadow store block, the matching LIN and block, and the data hash. A byte-by-byte data comparison is performed to verify the match and the request is then processed.
- If no match is found, the data is written to the file natively and the hash for the block is added to the in-memory index.
For in-line dedupe to perform on a write operation, the following conditions need to be true:
- In-line dedupe must be globally enabled on the cluster.
- The current operation is writing data (not a truncate or write zero operation).
- The no_dedupe flag is not set on the file.
- The file is not a special file type, such as an alternate data stream (ADS) or an EC (endurant cache) file.
- Write data includes fully overwritten and aligned blocks.
- The write is not part of a rehydrate operation.
- The file has not been packed (containerized) by small file storage efficiency (SFSE).
OneFS in-line dedupe uses the 128-bit CityHash algorithm, which is both fast and cryptographically strong. This contrasts with the OneFS post-process SmartDedupe, which uses SHA-1 hashing.
Each node in a cluster with in-line dedupe enabled has its own in-memory hash index that it compares block fingerprints against. The index lives in system RAM and is allocated using physically contiguous pages and is accessed directly with physical addresses. This avoids the need to traverse virtual memory mappings and does not incur the cost of translation lookaside buffer (TLB) misses, minimizing dedupe performance impact.
The maximum size of the hash index is governed by a pair of sysctl settings, one of which caps the size at 16 GB, and the other which limits the maximum size to 10% of total RAM. The strictest of these two constraints applies. While these settings are configurable, the recommended best practice is to use the default configuration. Any changes to these settings should only be performed under the supervision of Dell support.
Since in-line dedupe and SmartDedupe use different hashing algorithms, the indexes for each are not shared directly. However, the work performed by each dedupe solution can be used by each other. For instance, if SmartDedupe writes data to a shadow store, when those blocks are read, the read-hashing component of in-line dedupe sees those blocks and indexes them.
When a match is found, in-line dedupe performs a byte-by-byte comparison of each block to be shared to avoid the potential for a hash collision. Data is prefetched before the byte-by-byte check and is compared against the L1 cache buffer directly, avoiding unnecessary data copies and adding minimal overhead. Once the matching blocks are compared and verified as identical, they are shared by writing the matching data to a common shadow store and creating references from the original files to this shadow store.
In-line dedupe samples every whole block that is written and handles each block independently, so it can aggressively locate block duplicity. If a contiguous run of matching blocks is detected, in-line dedupe merges the results into regions and processes them efficiently.
In-line dedupe also detects dedupe opportunities from the read path, and blocks are hashed as they are read into L1 cache and inserted into the index. If an existing entry exists for that hash, in-line dedupe knows there is a block-sharing opportunity between the block it just read and the one previously indexed. It combines that information and queues a request to an asynchronous dedupe worker thread. As such, it is possible to deduplicate a data set purely by reading it all. To help mitigate the performance impact, the hashing is performed out-of-band in the prefetch path, rather than in the latency-sensitive read path.
The original in-line dedupe control path design had its limitations, since it did not provide gconfig control settings for the default-disabled in-line dedupe. The previous control-path logic had no gconfig control settings for default-disabled in-line dedupe. But in OneFS 9.4, there are now two separate features that interact together to distinguish between a new cluster or an upgrade to an existing cluster configuration:
For the first feature, upon upgrade to 9.4 on an existing cluster, if there is no in-line dedupe config present, the upgrade explicitly sets it to disabled in gconfig. This has no effect on an existing cluster since it’s already disabled. Similarly, if the upgrading cluster already has an existing in-line dedupe setting in gconfig, OneFS takes no action.
For the other half of the functionality, when booting OneFS 9.4, a node looks in gconfig to see if there’s an in-line dedupe setting. If no config is present, OneFS enables it by default. Therefore, new OneFS 9.4 clusters automatically enable dedupe, and existing clusters retain their legacy setting upon upgrade.
Since the in-line dedupe configuration is binary (either on or off across a cluster), you can easily control it manually through the OneFS command line interface (CLI). As such, the isi dedupe inline settings modify CLI command can either enable or disable dedupe at will—before, during, or after the upgrade. It doesn’t matter.
For example, you can globally disable in-line dedupe and verify it using the following CLI command:
# isi dedupe inline settings viewMode: enabled# isi dedupe inline settings modify –-mode disabled # isi dedupe inline settings view Mode: disabled
Similarly, the following syntax enables in-line dedupe:
# isi dedupe inline settings view Mode: disabled # isi dedupe inline settings modify –-mode enabled # isi dedupe inline settings view Mode: enabled
While there are no visible userspace changes when files are deduplicated, if deduplication has occurred, both the disk usage and the physical blocks metrics reported by the isi get –DD CLI command are reduced. Also, at the bottom of the command’s output, the logical block statistics report the number of shadow blocks. For example:
Metatree logical blocks: zero=260814 shadow=362 ditto=0 prealloc=0 block=2 compressed=0
In-line dedupe can also be paused from the CLI:
# isi dedupe inline settings modify –-mode paused # isi dedupe inline settings view Mode: paused
However, it’s worth noting that this global setting states what you’d like to happen, after which each node attempts to enact the new configuration. However, it can’t guarantee the change, because not all node types support in-line dedupe. For example, the following output is from a heterogenous cluster with an F200 three-node pool supporting in-line dedupe, and an H400 four-node pool not supporting it.
Here, we can see that in-line dedupe is globally enabled on the cluster:
# isi dedupe inline settings view Mode: enabled
However, you can use the isi_for_array isi_inline_dedupe_status command to display the actual setting and state of each node:
# isi dedupe inline settings view Mode: enabled # isi_for_array -s isi_inline_dedupe_status 1: OK Node setting enabled is correct 2: OK Node setting enabled is correct 3: OK Node setting enabled is correct 4: OK Node does not support inline dedupe and current is disabled 5: OK Node does not support inline dedupe and current is disabled 6: OK Node does not support inline dedupe and current is disabled 7: OK Node does not support inline dedupe and current is disabled
Also, any changes to the dedupe configuration are also logged to /var/log/messages, you can find them by grepping for inline_dedupe.
In a nutshell, in-line compression has always been enabled by default since its introduction in OneFS 8.1.3. For new clusters running 9.4 and above, in-line dedupe is on by default. For clusters running 9.3 and earlier, in-line dedupe remains disabled by default. And existing clusters that upgrade to 9.4 will not see any change to their current in-line dedupe config during upgrade.
And here’s the OneFS in-line data reduction platform support matrix for good measure:
OneFS 9.5 Performance Enhancements for Video Editing
Wed, 19 Jul 2023 18:16:59 -0000
|Read Time: 0 minutes
Of the many changes in OneFS 9.5, the most exciting are the performance enhancements on the NVMe-based PowerScale nodes: F900 and F600. These performance increases are the result of some significant changes “under-the-hood” to OneFS. In the lead-up to the National Association of Broadcasters show last April, I wanted to qualify how much of a difference the extra performance would make for Adobe Premiere Pro video editing workflows. Adobe is one of Dell’s biggest media software partners, and Premiere Pro is crucial to all sorts of media production, from broadcast to cinema.
The awesome news is that the changes to OneFS make a big difference. I saw 40% more video streams with the software upgrade: up to 140 streams of UHD ProRes422 from a single F900 node!
Changes to OneFS
Broadly speaking, there were changes to three areas in OneFS that resulted in the performance boost in version 9.5. These areas are L2 cache, backend networking, and prefetch.
L2 cache -- Being smart about how and when to bypass L2 cache and read directly from NVMe is one part of the OneFS 9.5 performance story. PowerScale OneFS clusters maintain a globally accessible L2 cache for all nodes in the cluster. Manipulating L2 cache can be “expensive” computationally speaking. During a read, the cluster needs to determine what data is in cache, whether the read should be added to cache, and what data should be expired from cache. NVMe storage is so performant that bypassing the L2 cache and reading data directly from NVMe frees up cluster resources. Doing so results in even faster reads on nodes that support it.
Backend networking -- OneFS uses a private backend network for internode communication. With the massive performance of NVMe based storage and the introduction of 100 GbE, limits were getting reached on this private network. OneFS 9.5 gets around these limitations with a custom multichannel approach (similar in concept to nconnect from the NFS world for the Linux folks out there). In OneFS 9.5, the connection channels on the backend network are bonded in a carefully orchestrated way to parallelize some aspects, while still keeping a predictable message ordering.
Prefetch -- The last part of the performance boost for OneFS 9.5 comes from improved file prefetch. How OneFS prefetches file system metadata was reworked to more optimally read ahead at the different depths of the metadata tree. Efficiency was improved and “jitter” between file system processes minimized.
Our lab setup
First a little background on PowerScale and OneFS. PowerScale is the updated name for the Isilon product line. The new PowerScale nodes are based on Dell servers with compute, RAM, networking, and storage. PowerScale is a scale-out, clustered network-attached-storage (NAS) solution. To build a OneFS file system, PowerScale nodes are joined to create cluster. The cluster creates a single NAS file system with the aggregate resources of all the nodes in the cluster. Client systems connect using a DNS name, and OneFS SmartConnect balances client connections between the various nodes. No matter which node the client connects to, that client has the potential to access all the data on the entire cluster. Further, the client systems benefit from the all the nodes acting in concert.
Even before the performance enhancements in OneFS 9.5, the NVMe-based PowerScale nodes were speedy, so a robust lab environment was going to be needed to stress the system. For this particular set of tests, I had access to 16 workstations running the latest version of Adobe Premiere Pro 2023. Each workstation ran Windows 10 with Nvidia GPU, Intel processor, and 10 GbE networking. On the storage side, the tests were performed against a minimum sized 3-node F900 PowerScale cluster with 100 GbE networking.
Adobe Premiere Pro excels at compressed video editing. The trick with compressed video is that an individual client workstation will get overwhelmed long before the storage system. As such, it is critical to evaluate whether any dropped frames are the result of storage or an overwhelmed workstation. A simple test is to take a single workstation and start playing back parallel compressed video streams, such as ProRes 422. Keeping a close watch on the workstation performance monitors, at a certain point CPU and GPU usage will spike and frames will drop. This test will show the maximum number of streams that a single workstation can handle. Because this test is all about storage performance, keeping the number of streams per workstation to a healthy range takes individual workstation performance out of the equation.
I settled on 10x streams of ProRes 422 UHD video running at 30 frames per second per workstation. Each individual video stream was ~70 MBps (560mbps). Running ten of these streams meant each workstation was pulling around 700 MBps (though with Premiere Pro prefetching this number was closer to 800 MBps). With this number of video streams, the workstation wasn’t working too hard and it was well within what would fit down a 10 GbE network pipe.
Running some quick math here, 16 workstations each pulling 800-ish MBps works out to about 12.5 GBps of total throughput. This throughput is not enough throughput to overwhelm even a small 3-node F900 cluster. In order to stress the system, all 16 workstations were manually pointed to single 100 GbE port on a single F900 node. Due to the clustered nature of OneFS, the clients will get benefit from the entire cluster. But even with the rest of the cluster behind it, at a certain point, a single F900 node is going to get overwhelmed.
Figure 1. OneFS Lab configuration
Test methodology
The first step was to import test media for playback. Each workstation accessed its own unique set of 10x one-hour long UHD ProRes422 clips. Then a separate Premiere Pro project was created for each workstation with 10 simultaneous layers of video. The plan was to start playback one by one on each workstation and see where the tipping point was for that single PowerScale F900 node. The test was to be run first with OneFS 9.4 and then with OneFS 9.5.
Adobe Premiere Pro has a debug overlay called DogEars. In addition to showing dropped frames, DogEars provides some useful metrics about how “healthy” video playback is in Premiere Pro. Even before a system starts to drop frames, latency spikes and low prefetch buffers show when Premiere Pro is struggling to sustain playback.
The metrics in DogEars that I was focused on were the following:
Dropped frames: This metric is obvious, dropped frames are unacceptable. However, at times Premiere Pro will show single digit dropped frames at playback start.
FramePrefetchLatency: This metric only shows up during playback. The latency starts high while the prefetch frame buffer is filling. When that buffer gets up to slightly over 300 frames, the latency drops down to around 20 to 30 milliseconds. When the storage system was overwhelmed, this prefetch latency goes well above 30 milliseconds and stays there.
CompleteAheadOfPlay: This metric also only shows up during playback. The number of frames creeps up during playback and settles in at slightly over 300 prefetched frames. The FramePrefetchLatency above will be high (in the 100ms range or so) until the 300 frames are prefetched, at which point the latency will drop down to 30ms or lower. When the storage system is stressed, Premiere Pro is never able to fill this prefetch buffer, and it never gets up to the 300+ frames.
Figure 2. Premiere Pro with Dogears overlay
Test results
With the test environment configured and the individual projects loaded, it was time to see what the system could provide.
With the PowerScale cluster running OneFS 9.4, playback was initiated on each Adobe Premiere workstation. Keep in mind that all the workstations were artificially pointed to a single node in this 3-node F900 cluster. That single F900 node running OneFS 9.4 could handle 10x of the workstations, each playing back 10x UHD streams. That’s 100x streams of UHD ProRes 422 video from one node. Not too shabby.
At 110x streams (11 workstations), no frames were dropped, but the CompleteAheadOfPlay number on all the workstations started to go below 300. Also, the FramePreFetchLatency spiked to over 100 milliseconds. Clearly, the storage node was unable to provide more performance.
After reproducing these results several times to confirm accuracy, we unmounted the storage from each workstation and upgraded the F900 cluster to OneFS 9.5. Time to see how much of a difference the OneFS 9.5 performance boost would make for Premiere Pro.
As before, each workstation loaded a unique project with unique ProRes media. At 100x streams of video, playback chugged along fine. Time to load up additional streams and see where things break. 110, 120, 130, 140… playback from the single F900 node continued to chug along with no drops and acceptable latency. It was only at 150 streams of video that playback began to suffer. By this time, that single F900 node was pumping close to 10GBps out of that single 100 GbE NIC port. These 14x workstations were not entirely saturating the connection, but getting close. And the performance was a 40% bump from the OneFS 9.4 numbers. Impressive.
Figure 3. isi statistics output with 140 streams of video from a single node
These results exceeded my expectations going into the project. Getting a 40% performance boost with a code upgrade to existing hardware is impressive. This increase lined up with some of the benchmarking tools used by engineering. But performance from a benchmark tool vs. a real-world application are often two entirely different things. Benchmark tools are particularly inaccurate for video playback where small increases in latency can result in unacceptable results. Because Adobe Premiere is one of the most widely used applications with PowerScale storage, it made sense as a test platform to gauge these differences. For more information about PowerScale storage and media, check out https://Dell.to/media.
Click here to learn more about the author, Gregory Shiff