Your Browser is Out of Date

Nytro.ai uses technology that works best in other browsers.
For a full experience use one of the browsers below

Blogs

Short articles related to Dell PowerScale.

blogs (51)

PowerScale OneFS object storage

Distributed Media Workflows with PowerScale OneFS and Superna Golden Copy

Gregory Shiff

Tue, 06 Sep 2022 20:46:32 -0000

|

Read Time: 0 minutes

Object is the new core

Content creation workflows are increasingly distributed between multiple sites and cloud providers. Data orchestration has long been a key component in these workflows. With the extra complexity (and functionality) of multiple on-premises and cloud infrastructures, automated data orchestration is more crucial than ever.

There has been a subtle but significant shift in how media companies store and manage data. In the old way, file storage formed the “core” and data was eventually archived off to tape or object storage for long-term retention. The new way of managing data flips this paradigm. Object storage has become the new “core” with performant file storage at edge locations used for data processing and manipulation.

Various factors have influenced this shift. These factors include the ever-increasing volume of data involved in modern productions, the expanding role of public cloud providers (for whom object storage is the default), and media application support.

  

Figure 1.  Global storage environment

With this shift in roles, new techniques for data orchestration become necessary. Data management vendors are reacting to these requirements for data movement and global file system solutions.

However, many of these solutions require data to be ingested and accessed through dedicated proprietary gateways. Often this gateway approach means that the data is now inaccessible using the native S3 API.

PowerScale OneFS and Superna Golden Copy provide a way of orchestrating data between file and object that retains the best qualities of both types of storage. Data is available to be accessed on both the performant edge (PowerScale) and the object core (ECS or public cloud) with no lock in at either end.

Further, Superna Golden Copy is directly integrated with the PowerScale OneFS API. The OneFS snapshot change list is used for immediate incremental data moves. Filesystem metadata is preserved in S3 tags.

Golden Copy and OneFS are a solution built for seamless movement of data between locations, file system, and object storage. File structure and metadata are preserved.

Right tool for the job

Data that originates on object storage needs to be accessible natively by systems that can speak object APIs. Also, some subset of data needs to be moved to file storage for further processing. Production data that originates on file storage similarly needs native access. That file data will need to be moved to object storage for long-term retention and to make it accessible to globally distributed resources.

Content creation workflows are spread across multiple teams working in many locations. Multisite productions require distributed storage ecosystems that can span geographies. This architecture is well suited to a core of object storage as the “central source of truth”. Pools of highly performant file storage serve teams in their various global locations.

The Golden Copy GraphQL API allows external systems to control, configure, and monitor Golden Copy jobs. This type of API-based data orchestration is essential to the complex global pipelines of content creators. Manually moving large amounts of data is untenable. Schedule-based movement of data aligns well with some content creation workflows; other workflows require more ad hoc data movement.

 Figure 2.  Object Core with GoldenCopy and PowerScale

A large ecosystem of production management tools, such as Autodesk Shotgrid, exist for managing global teams. These tools are excellent for managing projects, but do not typically include dedicated data movers. Data movement can be particularly challenging when large amounts of media need to be shifted between object and file.

Production asset management can trigger data moves with Golden Copy based on metadata changes to a production or scene. This kind of API and metadata driven data orchestration fits in the MovieLabs 2030 vision for software-defined workflows for content creation. This topic is covered in some detail for tiering within a OneFS file system in the paper: A Metadata Driven Approach to On Demand Tiering.

For more information about using PowerScale OneFS together with Superna GoldenCopy, see my full white paper PowerScale OneFS: Distributed Media Workflows.

Author: Gregory Shiff

Read Full Blog
AI PowerScale NFS performance metrics

Artificial Intelligence for IT operations (AIOps) in PowerScale Performance Prediction

Vincent Shen

Tue, 06 Sep 2022 18:14:53 -0000

|

Read Time: 0 minutes

AI is a fancy and hot topic in recent years. A common question from our customers is ‘How can AI help the day-to-day operation and management of PowerScale?’ It’s a very interesting question, because although AI can help realize so many possibilities, we still don’t have that many implementations of it in IT infrastructure. 

But, we finally have something that is very inspiring. Here is what we have achieved as proof of concept in our lab with the support of AI Dynamics, a professional AI platform company. 

Challenges for IT operations and opportunities for AIOps

With the increase in complexity of IT infrastructure comes the increase in the amount of data produced by these systems, Real-time performance logs, usage reports, audits, and other metadata can add up to gigabytes or terabytes a day. It is a big challenge for the IT department to analyze this data and to extract proactive predictions, such as IT infrastructure performance issues and their bottlenecks.

AIOps is the methodology to address these challenges. The term ‘AIOps’ refers to the use of artificial intelligence (AI), specifically machine learning (ML) techniques, to ingest, analyze, and learn from large volumes of data from every corner of the IT environment. The goal of AIOps is to allow IT departments to manage their assets and tackle performance challenges proactively, in real-time (or better), before they become system-wide issues. 

PowerScale key performance prediction using AIOps

Overview

In this solution, we identify NFS latency as the PowerScale performance indicator that customers would like to see predictive reporting about. The goal of the AI model is to study historical system activity and predict the NFS latency at five-minute intervals for four hours in the future. A traditional software system can use these predictions to alert users of a potential performance bottleneck based on the user’s specified latency threshold level and spike duration. In the future, AI models can be built that help diagnose the source of these issues so that both an alert and a best-recommended solution can be reported to the user.

The whole training process involves the following three steps (I’ll explain the details in the following sections):

  • Data preparation – to get the raw data and extract the useful features as the input for training and validation
  • Training the model – to pick up a proper AI architecture and tune the parameters for accuracy
  • Model validation – to validate the AI model based on the data set obtained from the training

Data preparation

The raw performance data is collected through Dell Secure Remote Services (SRS) from 12 different all-flash PowerScale clusters from an electronic design automation (EDA) customer each week. We identify and extract 26 performance key metrics from the raw data, most of which are logged and updated every five minutes. AI Dynamics NeoPulse is used to extract some additional fields (such as the day of the week and time of day from the UNIX timestamp fields) to allow the model to make better predictions. Each week new data was collected from the PowerScale cluster to increase the size of the training dataset and to improve the AI model. During every training run, we also withheld 10% of the data, which we used to test the AI model in the testing phase. This is separate from the 10% of training data withheld for validation.

Figure 1.  Data preparation process

Training the model

Over a period of two months, more than 50 different AI models were trained using a variety of different time series architectures, varying model architecture parameters, hyperparameters, and data engineering techniques to maximize performance, without overfitting to existing data. When these training pipelines were created in NeoPulse, they could easily be reused as new data arrived from the client each week, to rerun training and testing in order to quantify the performance of the model.

At the end of the two-month period, we had built a model that could predict whether this one performance metric (NFS3 latency) would be above a threshold of 10ms, correctly for 70% of each one of the next 48 five-minute intervals (four hours total).

Model validation

In the data preparation phase, we withheld 10% of the total data set to be used for AI model validation and testing. With the current AI model, end-users can easily configure the threshold of the latency as they want. In this case, we validated the model at 10ms and 15ms thresholds latency. The model can correctly identify over 70% of 10ms latency spikes and 60% of 15ms latency spikes over the entire ensuing four-hour period.

Figure 2.  Model Validation

Results

In this solution, we used NFS latency from PowerScale as the indicator to be predicted. The AI model uses the performance data from the previous four hours to predict the trends and spikes of NFS latency in the next four hours. If the software identifies a five-minute period when a >10ms latency spike would occur more than 70% of the time, it will trigger a configurable alert to the user.

The following diagram shows an example. At 8:55 a.m., the AI model predicts the NFS latency from 8:55 a.m. to 12:55 p.m., based on the input of performance data from 4:55 a.m. to 8:55 a.m. The AI model makes predictions for each five-minute period over the prediction duration. The model predicts a few isolated spikes in latency, with a large consecutive cluster of high latency between around 12 p.m. and 12:55 p.m. A software system can use this prediction to alert the user about the expected increase in latency, giving them over three hours to get ahead of the problem and reduce the server load. In the graph, the dotted line shows the AI model’s prediction, whereas the solid line shows actual performance.

Chart, line chart, histogram

Description automatically generated

Figure 3.  Dell PowerScale NFS Latency Forecasting

To sum up, the solution achieved the following:

  • By using the previous four hours of PowerScale performance data, the solution can forecast the next four hours of any specified metric.
  • For NFS3 latency, the solution was benchmarked as correctly identifying periods when latency would be above 10ms 70% of the time.
  • The data and model training pipelines created for this task can easily be adapted to predict other performance metrics, such as NFS throughput spikes, SMB latency spikes, and so on.
  • The AI learns to improve its predictions week by week as it adapts to each customer’s nuanced usage patterns, creating customized models for each customer’s idiosyncratic workload profiles.

Conclusion

AIOps introduces the intelligence needed to manage the complexity of modern IT environments. The NeoPulse platform from AI Dynamics makes AIOps easy to implement. In an all-flash configuration of Dell PowerScale clusters, performance is one of the key considerations. Hundreds and thousands of performance logs are generated every day and it is very easy for AIOps to consume the existing logs and provide insight into potential performance bottlenecks. Dell servers with GPUs are great platforms for performing training and inference, for not just this model but for any other new AI challenge the company wishes to tackle, across dozens of problem types.  

For additional details about our testing, see the white paper Key Performance Prediction using Artificial Intelligence for IT operations (AIOps).

Author: Vincent Shen

Read Full Blog
data storage CSI PowerScale

Network Design for PowerScale CSI

Sean Zhan Florian Coulombel

Tue, 23 Aug 2022 17:00:45 -0000

|

Read Time: 0 minutes

Network connectivity is an essential part of any infrastructure architecture. When it comes to how Kubernetes connects to PowerScale, there are several options to configure the Container Storage Interface (CSI). In this post, we will cover the concepts and configuration you can implement.

The story starts with CSI plugin architecture.

CSI plugins

Like all other Dell storage CSI, PowerScale CSI follows the Kubernetes CSI standard by implementing functions in two components.

  • CSI controller plugin
  • CSI node plugin 

The CSI controller plugin is deployed as a Kubernetes Deployment, typically with two or three replicas for high-availability, with only one instance acting as a leader. The controller is responsible for communicating with PowerScale, using Platform API to manage volumes (to PowerScale it’s to create/delete directories, NFS exports, and quotas), to update the NFS client list when a Pod moves, and so on.

A CSI node plugin is a Kubernetes DaemonSet, running on all nodes by default. It’s responsible for mounting the NFS export from PowerScale, to map the NFS mount path to a Pod as persistent storage, so that applications and users in the Pod can access the data on PowerScale.

Roles, privileges, and access zone

Because CSI needs to access both PAPI (PowerScale Platform API) and NFS data, a single user role typically isn’t secure enough: the role for PAPI access will need more privileges than normal users.

According to the PowerScale CSI manual, CSI requires a user that has the following privileges to perform all CSI functions:

Privilege

Type

ISI_PRIV_LOGIN_PAPI

Read Only

ISI_PRIV_NFS

Read Write

ISI_PRIV_QUOTA

Read Write

ISI_PRIV_SNAPSHOT

Read Write

ISI_PRIV_IFS_RESTORE

Read Only

ISI_PRIV_NS_IFS_ACCESS

Read Only

ISI_PRIV_IFS_BACKUP

Read Only

Among these privileges, ISI_PRIV_SNAPSHOT and ISI_PRIV_QUOTA are only available in the System zone. And this complicates things a bit. To fully utilize these CSI features, such as volume snapshot, volume clone, and volume capacity management, you have to allow the CSI to be able to access the PowerScale System zone. If you enable the CSM for replication, the user needs the ISI_PRIV_SYNCIQ privilege, which is a System-zone privilege too.

By contrast, there isn’t any specific role requirement for applications/users in Kubernetes to access data: the data is shared by the normal NFS protocol. As long as they have the right ACL to access the files, they are good. For this data accessing requirement, a non-system zone is suitable and recommended.

These two access zones are defined in different places in CSI configuration files:

  • The PAPI access zone name (FQDN) needs to be set in the secret yaml file as “endpoint”, for example “f200.isilon.com”.
  • The data access zone name (FQDN) needs to be set in the storageclass yaml file as “AzServiceIP”, for example “openshift-data.isilon.com”.

If an admin really cannot expose their System zone to the Kubernetes cluster, they have to disable the snapshot and quota features in the CSI installation configuration file (values.yaml). In this way, the PAPI access zone can be a non-System access zone.

The following diagram shows how the Kubernetes cluster connects to PowerScale access zones.

Network

Normally a Kubernetes cluster comes with many networks: a pod inter-communication network, a cluster service network, and so on. Luckily, the PowerScale network doesn’t have to join any of them. The CSI pods can access a host’s network directly, without going through the Kubernetes internal network. This also has the advantage of providing a dedicated high-performance network for data transfer.

For example, on a Kubernetes host, there are two NICs: IP 192.168.1.x and 172.24.1.x. NIC 192.168.1.x is used for Kubernetes, and is aligned with its hostname. NIC 172.24.1.x isn’t managed by Kubernetes. In this case, we can use NIC 172.24.1.x for data transfer between Kubernetes hosts and PowerScale.

Because by default, the CSI driver will use the IP that is aligned with its hostname, to let CSI recognize the second NIC 172.24.1.x, we have explicitly set the IP range in “allowedNetworks” in the values.yaml file in the CSI driver installation. For example:

allowedNetworks: [172.24.1.0/24]

Also, in this network configuration, it’s unlikely that the Kubernetes internal DNS can resolve the PowerScale FQDN. So, we also have to make sure the “dnsPolicy” has been set to “ClusterFirstWithHostNet” in the values.yaml file. With this dnsPolicy, the CSI pods will reach the DNS server in /etc/resolv.conf in the host OS, not the internal DNS server of Kubernetes.

The following diagram shows the configuration mentioned above:

Please note that the “allowedNetworks” setting only affects the data access zone, and not the PAPI access zone. In fact, CSI just uses this parameter to decide which host IP should be set as the NFS client IP on the PowerScale side.

Regarding the network routing, CSI simply follows the OS route configuration. Because of that, if we want the PAPI access zone to go through the primary NIC (192.168.1.x), and have the data access zone to go through the second NIC (172.24.1.x), we have to change the route configuration of the Kubernetes host, not this parameter.

Hopefully this blog helps you understand the network configuration for PowerScale CSI better. Stay tuned for more information on Dell Containers & Storage!

Authors: Sean Zhan, Florian Coulombel

Read Full Blog
security PowerScale OneFS

Disabling the WebUI and other Non-essential Services

Aqib Kazi

Mon, 25 Jul 2022 13:43:38 -0000

|

Read Time: 0 minutes

In today's security environment, organizations must adhere to governance security requirements, including disabling specific HTTP services.

OneFS release 9.4.0.0 has introduced an option to disable non-essential cluster services selectively rather than disabling all HTTP services. Disabling selectively allows administrators to determine which services are necessary. Disabling the services allows other essential services on the cluster to continue to run. You can disable the following non-essential services:

  • PowerScaleUI (WebUI)
  • Platform-API-External
  • Rest Access to Namespace (RAN)
  • RemoteService

Each of these services can be disabled independently and has no impact on other HTTP-based data services. The services can be disabled through the CLI or API with the ISI_PRIV_HTTP privilege. To manage the non-essential services from the CLI, use the isi http services list command to list the services. Use the isi http services view and isi http services modify commands to view and modify the services. The impact of disabling each of the services is listed in the following table.

HTTP services impacts

Service

Impacts

PowerScaleUI

The WebUI is entirely disabled. Attempting to access the WebUI displays Service Unavailable. Please contact Administrator.

Platform-API-External

Disabling the Platform-API-External service does not impact the Platform-API-Internal service of the cluster. The Platform-API-Internal services continue to function, even if the Platform-API-External service is disabled. However, if the Platform-API-External service is disabled, the WebUI is also disabled at that time, because the WebUI uses the Platform-API-External service.

RAN (Remote Access to Namespace)

If RAN is disabled, use of the Remote File Browser UI component is restricted in the Remote File Browser and the File System Explorer.

RemoteService

If RemoteService is disabled, the remote support UI and the InProduct Activation UI components are restricted.

To disable the WebUI, use the following command:

isi http services modify --service-id=PowerScaleUI --enabled=false

Author: Aqib Kazi



Read Full Blog
VMware PowerScale cloud Google Cloud NAS

Dell PowerScale for Google Cloud New Release Available

Lieven Lin

Fri, 22 Jul 2022 15:18:57 -0000

|

Read Time: 0 minutes

PowerScale for Google Cloud provides the native-cloud experience of file services with high performance. It is a scalable file service that provides high-speed file access over multiple protocols, including SMB, NFS, and HDFS. PowerScale for Google Cloud enables customers to run their cloud workloads on the PowerScale scale-out NAS storage system. The following figure shows the architecture of PowerScale for Google Cloud. The three main parts are the Dell Technologies partner data center, the Dell Technologies Google Cloud organization (isiloncloud.com), and the customer’s Google Cloud organization (for example, customer-a.com and customer-b.com).

PowerScale for Google Cloud: a new release

We proudly released a new version of PowerScale for Google Cloud on July 8, 2022. It provides the following key features and enhancements:

More flexible configuration to choose

In the previous version of PowerScale for Google Cloud, only several pre-defined node tiers were available. With the latest version, you can purchase all PowerScale node types to fit your business needs and accelerate your native-cloud file service experience. 

New location available in EMEA region

In the previous version, the supported regions include North America and APJ (Australia and Singapore). We are now adding the EMEA region, which includes London, Frankfurt, Paris, and Warsaw.

Google Cloud VMware Engine (GCVE) Certification

PowerScale for Google Cloud is now certified to support GCVE. GCVE guest VMs can connect to PowerScale for Google Cloud file services to fully leverage PowerScale cluster storage. We’ll be taking a deeper look at the details in blog articles in the next few weeks.

Want to know more about the powerful cloud file service solution? Just click these links:

Resources

Author: Lieven Lin


Read Full Blog
PowerScale OneFS NAS

PowerScale Delivers Better Efficiency and Higher Node Density with Gen2 QLC Drives

Cris Banson

Wed, 13 Jul 2022 14:50:00 -0000

|

Read Time: 0 minutes

Quad-level cell (QLC) flash memory 15TB and 30TB drives have just been made available for the PowerScale F900 and F600 all-flash models. These new QLC drives, supported by the currently shipping OneFS 9.4 release, offer our customers optimum economics for NAS workloads that require performance, reliability, and capacity – such as financial modeling, media and entertainment, artificial intelligence (AI), machine learning (ML), and deep learning (DL). See the preview of this technology that we provided in May at Dell Technologies World (DTW).

PowerScale F900/F600 QLC raw capacity

 

Chassis design (per node)

Raw capacity per node

Raw capacity for maximum cluster configuration (252 nodes)

F900

2U with 24 NVMe SSD drives

737.28TB with 30.72TB QLC

368.6TB with 15.36TB QLC

185.79PB with 30.72TB QLC

92.89PB with 15.36TB QLC

F600

1U with 8 NVMe SSD drives

245.76TB with 30.72TB QLC

122.88TB with 15.36TB QLC

61.93PB with 30.72TB QLC

30.96PB with 15.36TB QLC

QLC drives expand the data lake with up to 2x more capacity than previous generations in the same footprint, while delivering savings in consolidated rack space and power/cooling. From the edge to the core and to the cloud, PowerScale systems deliver simplicity, value, performance, flexibility, and choice.   

  • Dell PowerScale with QLC drives delivers better efficiency with half the power and half the rack space required per TB as compared with current highest capacity node.1
  • Dell PowerScale with QLC drives delivers up to 2x higher raw cluster capacity as compared with current all-flash drives.2
  • Dell PowerScale with QLC drives delivers up to 2x higher raw node density as compared with current all-flash drives.3
  • Dell PowerScale with QLC drives delivers up to 19% lower price per TB as compared with current all-flash drives.4 

PowerScale nodes comprised of QLC drives can deliver the same level of performance as those nodes comprised of TLC drives, while requiring only half the power and half the rack space. They are also up to 19% lower in price per TB, thus delivering superior economics and value to our customers. QLC-enabled nodes performed at parity or slightly better than TLC-enabled nodes for throughput benchmarks and SPEC workloads. 5

QLC drive-enabled nodes deliver the same performance as TLC while improving efficiency and doubling cluster capacity

These QLC drives become part of the overall lifecycle management system within OneFS which gives PowerScale a major TCO advantage over the competition. Seamless integration of nodes with QLC drives into existing PowerScale clusters allows clusters to take on new workloads. To address the storage capacity, performance needs, and cost optimization requirements for today’s workloads (while being powerful enough to handle the unpredictable demands of tomorrow), PowerScale systems are designed to provide customers choice, scale, and flexibility.

“With PowerScale, we have the flexibility to deploy the right storage with the right performance and right capacity to meet our business needs of today and the future,” said Michael Loggins, Global Vice President, Information Technology, SMC Corporation of America.

For more information about the PowerScale F600 and F900 QLC drives, visit the PowerScale all-flash spec sheet.

-------------------------------------------------------

1Based on Dell internal analysis, June 2022. Actual results will vary.   

2Based on Dell internal analysis, June 2022.

3Based on Dell internal analysis, June 2022.

4Based on Dell internal pricing analysis, June 2022. Actual results will vary.   

5Based on Dell internal testing, April 2022. Actual results will vary. 

 

Author: Cris Banson


Read Full Blog
PowerScale OneFS data access

Data Access in OneFS - Part 2: Introduction to OneFS Access Tokens

Lieven Lin

Fri, 01 Jul 2022 14:15:16 -0000

|

Read Time: 0 minutes

Recap

In the previous blog, we introduced the OneFS file permission basics, including:

1. OneFS file permission is only in one of the following states:

  • POSIX mode bits - authoritative with a synthetic ACL
  • OneFS ACL - authoritative with approximate POSIX mode bits

2. No matter the OneFS file permission state, the on-disk identity for a file is always a UID, a GID, or an SID. The name of a user or group is for display only.

3. When OneFS receives a user access request, it generates an access token for the user and compares the token to the file permissions based on UID/GID/SID.

Therefore, in this blog, we will explain what UID/GID/SID is, and will explain what a OneFS access token is. Now, let’s start by looking at UID/GID/SIDs.

UID/GID and SID

In our daily life, we are usually familiar with a username or a group name, which stands for a user or a group. In a NAS system, we use UID, GID, and SID to identify a user or a group, then the NAS system will resolve the UID, GID, and SID into a related username or group name.

The UID/GID is usually used in a UNIX environment to identify users/groups with a positive integer assigned. The UID/GID is usually provided by the local operating system and LDAP server.

The SID is usually used in a Windows environment to identify users/groups. The SID is usually provided by the local operating system and Active Directory (AD). The SID is written in the format:

            (SID)-(revision level)-(identifier-authority)-(subauthority1)-(subauthority2)-(etc)

for example:

S-1-5-21-1004336348-1177238915-682003330-512

For more information about SIDs, see the Microsoft article: What Are Security Identifiers?.

OneFS access token

In OneFS, information about users and groups is managed and stored in different authentication providers, including UID/GID and SID information, and user group membership information. OneFS can add multiple types of authentication provider, including:

  • Active Directory (AD)
  • Lightweight Directory Access Protocol (LDAP) servers
  • NIS
  • File provider
  • Local provider

OneFS retrieves a user’s identity (UID/GID/SID) and group memberships from the above authentication providers. Assuming that we have a user named Joe, OneFS tries to resolve Joe’s UID/GID and group memberships from LDAP, NIS, file provider, and Local provider. Meanwhile, it also tries to resolve Joe’s SID and group memberships from AD, file provider, or local provider. 

  • If neither UID/GID nor SID can be found in any of the authentication providers, the user does not exist. User access may be denied or be mapped to the ‘nobody’ user, depending on your protocol. 
  • If only a UID/GID can be found or only a SID can be found, OneFS generates a fake UID or SID for the user.

It is not always the case that OneFS needs to resolve a user from username to UID/GID/SID. It is also possible that OneFS needs to resolve a user in reverse: that is, resolve a UID to its related username. This usually occurs when using NFSv3. When OneFS gets all UID/GID/SID information for a user, it will maintain the identity relationship in a local database, which records the UID <--> SID and GID <-->SID mapping, also known as the ID mapping function in OneFS.

Now, you should have an overall idea about how OneFS maintains the important UID/GID/SID information, and how to retrieve this information as needed.

Next, let’s see how OneFS can determine whether different usernames in different authentication types are actually the same real user. For example: how can we tell if the Joe in AD and the joe_f in LDAP is same guy or not? If they are the same, OneFS needs to ensure that they have the same access to the same file, even with different protocols.

That is the magic of the OneFS user mapping function. The default user mapping rule maps users together that have the same usernames in different authentication providers. For example, the Joe in AD and the Joe in LDAP will be considered the same user. You must create user mapping rules if a real user has different names in different authentication providers. The user mapping rule can have different operators, to provide more flexible management between different usernames in different authentication providers. The operators could be Append, Insert, Replace, Remove Groups, Join. See OneFS user mapping operators for more details. We just need to remember that the user mapping is just a function to determine if the user information in an authentication provider should be used when generating an access token. 

Although it is easy to confuse user mapping with ID mapping, user mapping is the process of identifying users across authentication providers for the purpose of token generation. After the token is generated, the mappings of SID<-->UID are placed in the ID mapping database.

Finally, OneFS must choose an authoritative identity (that is, an On-Disk Identity) from the collected/generated UID/GID/SID for the user, which will be stored on disk and is used when the file is created or when ownership of file changes, impacting the file permissions.

In a single protocol environment, determining the On-Disk Identity is simple because Windows uses SIDs and Linux uses UIDs. However, in a multi-protocol environment, only one identity is stored, and the challenge is determining which one is stored. By default, the policy configured for on-disk identity is Native mode. Native mode is the best option for most environments. OneFS selects the real value between the SID and UID/GID. If both the SID and UID/GID are real values, OneFS selects UID/GID. Please note that this blog series is based on the default policy setting.

Now you should have an overall understanding of user mapping, ID mapping, and on-disk identity. These are the key concepts when understanding user access tokens and doing troubleshooting. Finally, let’s see what an access token contains. 

You can view a user’s access token by using the command isi auth mapping token <username> in OneFS. Here is an example of Joe’s access token:

vonefs-aima-1# isi auth mapping token Joe
                   User
                       Name: Joe
                        UID: 2001
                        SID: S-1-5-21-1137111906-3057660394-507681705-1002
                    On Disk: 2001
                    ZID: 1
                   Zone: System
             Privileges: -
          Primary Group
                       Name: Market
                        GID: 2003
                        SID: S-1-5-21-1137111906-3057660394-507681705-1006
                    On Disk: 2003
Supplemental Identities
                       Name: Authenticated Users
                        SID: S-1-5-11 Below 

From the above output, we can see that an access token contains the following information:

  • User’s username, UID, SID, and final on-disk identity
  • Access zone ID and name
  • OneFS RBAC privileges
  • Primary group name, GID, SID, and final on-disk identity
  • Supplemental group names, GID or SID.

Still, remember that we have a file created and owned by Joe in the previous blog? Here are the file permissions:

vonefs-aima-1# ls -le acl-file.txt
-rwxrwxr-x +   1 Joe  Market   69 May 28 01:08 acl-file.txt
 OWNER: user:Joe
 GROUP: group:Market
 0: user:Joe allow file_gen_all
 1: group:Market allow file_gen_read,file_gen_execute
 2: user:Bob allow file_gen_all
 3: everyone allow file_gen_read,file_gen_execute

The ls -le command here shows the user’s username only. And we already emphasized that the on-disk identity is always UID/GID or SID, so let’s use the ls -len command to show the on-disk identities. In the following output, we see that Joe’s on-disk identity is his UID 2001, and his GID 2003. When Joe wants to access the file, OneFS compares Joe’s access token with the file permissions below, finds that Joe’s UID is 2001 in his token, and grants him access to the file.

vonefs-aima-1# ls -len acl-file.txt
-rwxrwxr-x +   1 2001  2003   69 May 28 01:08 acl-file.txt
 OWNER: user:2001
 GROUP: group:2003
 0: user:2001 allow file_gen_all
 1: group:2003 allow file_gen_read,file_gen_execute
 2: user:2002 allow file_gen_all
 3: SID:S-1-1-0 allow file_gen_read,file_gen_execute

The above Joe is a OneFS local user from a local provider. Next, we will see what the access token looks like if a user’s SID is from AD and UID/GID is from LDAP.

Let’s assume that user John has an account named John_AD in AD, and also has an account named John_LDAP in LDAP server. This means that OneFS has to ensure that the two usernames have consistent access permissions on a file. To achieve that, we need to create a user mapping rule to join them together, so that the final access token will contain the SID information in AD and UID/GID information in LDAP. The access token for John_AD looks like this:

vonefs-aima-1# isi auth mapping token vlab\\John_AD
                   User
                       Name: VLAB\john_ad
                         UID: 1000019
                        SID: S-1-5-21-2529895029-2434557131-462378659-1110
                    On Disk: S-1-5-21-2529895029-2434557131-462378659-1110
                    ZID: 1
                   Zone: System
             Privileges: -
          Primary Group
                        Name: VLAB\domain users
                         GID: 1000041
                         SID: S-1-5-21-2529895029-2434557131-462378659-513
                    On Disk: S-1-5-21-2529895029-2434557131-462378659-513
Supplemental Identities
                        Name: Users
                         GID: 1545
                         SID: S-1-5-32-545
 
                        Name: Authenticated Users
                         SID: S-1-5-11
 
                       Name: John_LDAP
                        UID: 19421
                         SID: S-1-22-1-19421
 
                        Name: ldap_users
                         GID: 32084
                         SID: S-1-22-2-32084

Assume that a file that is owned and only accessible by John_LDAP has the file permissions shown in the following output. As the John_AD and John_LDAP is joined together with a user mapping rule, the John_LDAP identity (UID) is also in the John_AD access token, so John_AD can also access the file.

vonefs-aima-1# ls -le john_ldap.txt
-rwx------     1 John_LDAP  ldap_users  19 Jun 15 07:36 john_ldap.txt
 OWNER: user:John_LDAP
 GROUP: group:ldap_users
 SYNTHETIC ACL
 0: user:John_LDAP allow file_gen_read,file_gen_write,file_gen_execute,std_write_dac
 1: group:ldap_users allow std_read_dac,std_synchronize,file_read_attr

You should now have an understanding of OneFS access tokens, and how they are used to determine a user’s authorized operation on data, through file permission checking.

In my next blog, we will see what will happen for different protocols when accessing OneFS data.

Resources

Author: Lieven Lin

 



Read Full Blog
data protection PowerScale OneFS

OneFS Smartfail

Nick Trimbee

Mon, 27 Jun 2022 21:03:17 -0000

|

Read Time: 0 minutes

OneFS protects data stored on failing nodes or drives in a cluster through a process called smartfail. During the process, OneFS places a device into quarantine and, depending on the severity of the issue, the data on it into a read-only state. While a device is quarantined, OneFS reprotects the data on the device by distributing the data to other devices.

After all data eviction or reconstruction is complete, OneFS logically removes the device from the cluster, and the node or drive can be physically replaced. OneFS only automatically smartfails devices as a last resort. Nodes and/or drives can also be manually smartfailed. However, it is strongly recommended to first consult Dell Technical Support.

Occasionally a device might fail before OneFS detects a problem. If a drive fails without being smartfailed, OneFS automatically starts rebuilding the data to available free space on the cluster. However, because a node might recover from a transient issue, if a node fails, OneFS does not start rebuilding data unless it is logically removed from the cluster.

A node that is unavailable and reported by isi status as ‘D’, or down, can be smartfailed. If the node is hard down, likely with a significant hardware issue, the smartfail process will take longer because data has to be recalculated from the FEC protection parity blocks. That said, it’s well worth attempting to bring the node up if at all possible – especially if the cluster, and/or node pools, is at the default +2D:1N protection. The concern here is that, with a node down, there is a risk of data loss if a drive or other component goes bad during the smartfail process.

If possible, and assuming the disk content is still intact, it can often be quicker to have the node hardware repaired. In this case, the entire node’s chassis (or compute module in the case of Gen 6 hardware) could be replaced and the old disks inserted with original content on them. This should only be performed at the recommendation and under the supervision of Dell Technical Support. If the node is down because of a journal inconsistency, it will have to be smartfailed out. In this case, engage Dell Technical Support to determine an appropriate action plan.

The recommended procedure for smartfailing a node is as follows. In this example, we’ll assume that node 4 is down:

From the CLI of any node except node 4, run the following command to smartfail out the node:

# isi devices node smartfail --node-lnn 4

Verify that the node is removed from the cluster.

# isi status –q

(An ‘—S-’ will appear in node 4’s ‘Health’ column to indicate it has been smartfailed).

Monitor the successful completion of the job engine’s MultiScan, FlexProtect/FlexProtectLIN jobs:

# isi job status

Un-cable and remove the node from the rack for disposal.

As mentioned previously, there are two primary Job Engine jobs that run as a result of a smartfail:

  • MultiScan
  • FlexProtect or FlexProtectLIN

MultiScan performs the work of both the AutoBalance and Collect jobs simultaneously, and it is triggered after every group change. The reason is that new file layouts and file deletions that happen during a disruption to the cluster might be imperfectly balanced or, in the case of deletions, simply lost.

The Collect job reclaims free space from previously unavailable nodes or drives. A mark and sweep garbage collector, it identifies everything valid on the filesystem in the first phase. In the second phase, the Collect job scans the drives, freeing anything that isn’t marked valid.

When node and drive usage across the cluster are out of balance, the AutoBalance job scans through all the drives looking for files to re-layout, to make use of the less filled devices.

The purpose of the FlexProtect job is to scan the file system after a device failure to ensure that all files remain protected. Incomplete protection levels are fixed, in addition to missing data or parity blocks caused by drive or node failures. This job is started automatically after smartfailing a drive or node. If a smartfailed device was the reason the job started, the device is marked gone (completely removed from the cluster) at the end of the job.

Although a new node can be added to a cluster at any time, it’s best to avoid major group changes during a smartfail operation. This helps to avoid any unnecessary interruptions of a critical job engine data reprotection job. However, because a node is down, there is a window of risk while the cluster is rebuilding the data from that cluster. Under pressing circumstances, the smartfail operation can be paused, the node added, and then smartfail resumed when the new node has happily joined the cluster.

Be aware that if the node you are adding is the same node that was smartfailed, the cluster maintains a record of that node and may prevent the re-introduction of that node until the smartfail is complete. To mitigate risk, Dell Technical Support should definitely be involved to ensure data integrity.

The time for a smartfail to complete is hard to predict with any accuracy, and depends on:

Attribute

Description

OneFS release

Determines OneFS job engine version and how efficiently it operates.

System hardware

Drive types, CPU, RAM, and so on.

File system

Quantity and type of data (that is, small vs. large files), protection, tunables, and so on.

Cluster load

Processor and CPU utilization, capacity utilization, and so on.

Typical smartfail runtimes range from minutes (for fairly empty, idle nodes with SSD and SAS drives) to days (for nodes with large SATA drives and a high capacity utilization). The FlexProtect job already runs at the highest job engine priority (value=1) and medium impact by default. As such, there isn’t much that can be done to speed up this job, beyond reducing cluster load.

Smartfail is also a valuable tool for proactive cluster node replacement, such as during a hardware refresh. Provided that the cluster quorum is not broken, a smartfail can be initiated on multiple nodes concurrently – but never more than n/2 – 1 nodes (rounded up)!

If replacing an entire node pool as part of a tech refresh, a SmartPools filepool policy can be crafted to migrate the data to another node pool across the backend network. When complete, the nodes can then be smartfailed out, which should progress swiftly because they are now empty.

If multiple nodes are smartfailed simultaneously, at the final stage of the process the node remove is serialized with roughly a 60 second pause between each. The smartfail job places the selected nodes in read-only mode while it copies the protection stripes to the cluster’s free space. Using SmartPools to evacuate data from a node or set of nodes, in preparation to remove them, is generally a good idea, and usually a relatively fast process.

SmartPools’ Virtual Hot Spare (VHS) functionality helps ensure that node pools maintain enough free space to successfully re-protect data in the event of a smartfail. Though configured globally, VHS actually operates at the node pool level so that nodes with different size drives reserve the appropriate VHS space. This helps ensure that while data may move from one disk pool to another during repair, it remains on the same class of storage. VHS reservations are cluster wide and configurable, as either a percentage of total storage (0-20%), or as a number of virtual drives (1-4), with the default being 10%.

Note: a smartfail is not guaranteed to remove all data on a node. Any pool in a cluster that’s flagged with the ‘System’ flag can store /ifs/.ifsvar data. A filepool policy to move the regular data won’t address this data. Also, because SmartPools ‘spillover’ may have occurred at some point, there is no guarantee that an ‘empty’ node is completely devoid of data. For this reason, OneFS still has to search the tree for files that may have blocks residing on the node.

Nodes can be easily smartfailed from the OneFS WebUI by navigating to Cluster Management > Hardware Configuration and selecting ‘Actions > More > Smartfail Node’ for the desired node(s):

Similarly, the following CLI commands first initiate and then halt the node smartfail process, respectively. First, the ‘isi devices node smartfail’ command kicks off the smartfail process on a node and removes it from the cluster.

# isi devices node smartfail -h
Syntax
# isi devices node smartfail
[--node-lnn <integer>]
[--force | -f]
[--verbose | -v]

If necessary, the ‘isi devices node stopfail’ command can be used to discontinue the smartfail process on a node.

# isi devices node stopfail -h
Syntax
isi devices node stopfail
[--node-lnn <integer>]
[--force | -f]
[--verbose | -v]

Similarly, individual drives within a node can be smartfailed with the ‘isi devices drive smartfail’ CLI command.

# isi devices drive smartfail { <bay> | --lnum <integer> | --sled <string> }
        [--node-lnn <integer>]
        [{--force | -f}]
        [{--verbose | -v}]
        [{--help | -h}]

Author: Nick Trimbee



Read Full Blog
PowerScale OneFS SmartPools

OneFS SmartPools and the FilePolicy Job

Nick Trimbee

Fri, 24 Jun 2022 18:22:15 -0000

|

Read Time: 0 minutes

Traditionally, OneFS has used the SmartPools jobs to apply its file pool policies. To accomplish this, the SmartPools job visits every file, and the SmartPoolsTree job visits a tree of files. However, the scanning portion of these jobs can result in significant random impact to the cluster and lengthy execution times, particularly in the case of the SmartPools job. To address this, OneFS also provides the FilePolicy job, which offers a faster, lower impact method for applying file pool policies than the full-blown SmartPools job.

But first, a quick Job Engine refresher…

As we know, the Job Engine is OneFS’ parallel task scheduling framework, and is responsible for the distribution, execution, and impact management of critical jobs and operations across the entire cluster.

The OneFS Job Engine schedules and manages all data protection and background cluster tasks: creating jobs for each task, prioritizing them, and ensuring that inter-node communication and cluster wide capacity utilization and performance are balanced and optimized. Job Engine ensures that core cluster functions have priority over less important work and gives applications integrated with OneFS – Isilon add-on software or applications integrating to OneFS by means of the OneFS API – the ability to control the priority of their various functions to ensure the best resource utilization.

Each job, such as the SmartPools job, has an “Impact Profile” comprising a configurable policy and a schedule that characterizes how much of the system’s resources the job will take – plus an Impact Policy and an Impact Schedule. The amount of work a job has to do is fixed, but the resources dedicated to that work can be tuned to minimize the impact to other cluster functions, like serving client data.

Here’s a list of the specific jobs that are directly associated with OneFS SmartPools:

Job

Description

SmartPools

Job that runs and moves data between the tiers of nodes within the same cluster. Also executes the CloudPools functionality if licensed and configured.

SmartPoolsTree

Enforces SmartPools file policies on a subtree.

FilePolicy

Efficient changelist-based SmartPools file pool policy job.

IndexUpdate

Creates and updates an efficient file system index for FilePolicy job.

SetProtectPlus

Applies the default file policy. This job is disabled if SmartPools is activated on the cluster.

In conjunction with the IndexUpdate job, FilePolicy improves job scan performance by using a ‘file system index’, or changelist, to find files needing policy changes, rather than a full tree scan.

 

Avoiding a full treewalk dramatically decreases the amount of locking and metadata scanning work the job is required to perform, reducing impact on CPU and disk – albeit at the expense of not doing everything that SmartPools does. The FilePolicy job enforces just the SmartPools file pool policies, as opposed to the storage pool settings. For example, FilePolicy does not deal with changes to storage pools or storage pool settings, such as:

  • Restriping activity due to adding, removing, or reorganizing node pools
  • Changes to storage pool settings or defaults, including protection

However, most of the time, SmartPools and FilePolicy perform the same work. Disabled by default, FilePolicy supports the full range of file pool policy features, reports the same information, and provides the same configuration options as the SmartPools job. Because FilePolicy is a changelist-based job, it performs best when run frequently – once or multiple times a day, depending on the configured file pool policies, data size, and rate of change.

Job schedules can easily be configured from the OneFS WebUI by navigating to Cluster Management > Job Operations, highlighting the desired job, and selecting ‘View\Edit’. The following example illustrates configuring the IndexUpdate job to run every six hours at a LOW impact level with a priority value of 5:

When enabling and using the FilePolicy and IndexUpdate jobs, the recommendation is to continue running the SmartPools job as well, but at a reduced frequency (monthly).

In addition to running on a configured schedule, the FilePolicy job can also be executed manually.

FilePolicy requires access to a current index. If the IndexUpdate job has not yet been run, attempting to start the FilePolicy job will fail with the error shown in the following figure. Instructions in the error message appear, prompting to run the IndexUpdate job first. When the index has been created, the FilePolicy job will run successfully. The IndexUpdate job can be run several times daily (that is, every six hours) to keep the index current and prevent the snapshots from getting large.

Consider using the FilePolicy job with the job schedules below for workflows and datasets with the following characteristics:

  • Data with long retention times
  • Large number of small files
  • Path-based File Pool filters configured
  • Where the FSAnalyze job is already running on the cluster (InsightIQ monitored clusters)
  • There is already a SnapshotIQ schedule configured
  • When the SmartPools job typically takes a day or more to run to completion at LOW impact

For clusters without the characteristics described above, the recommendation is to continue running the SmartPools job as usual and not to activate the FilePolicy job.

The following table provides a suggested job schedule when deploying FilePolicy:

Job

Schedule

Impact

Priority

FilePolicy

Every day at 22:00

LOW

6

IndexUpdate

Every six hours, every day

LOW

5

SmartPools

Monthly – Sunday at 23:00

LOW

6

Because no two clusters are the same, this suggested job schedule may require additional tuning to meet the needs of a specific environment.

Note that when clusters running older OneFS versions and the FSAnalyze job are upgraded to OneFS 8.2.x or later, the legacy FSAnalyze index and snapshots are removed and replaced by new snapshots the first time that IndexUpdate is run. The new index stores considerably more file and snapshot attributes than the old FSA index. Until the IndexUpdate job effects this change, FSA keeps running on the old index and snapshots.

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS CloudPools

Preparations for Upgrading a CloudPools Environment

Jason He

Thu, 23 Jun 2022 15:44:20 -0000

|

Read Time: 0 minutes

Introduction

CloudPools 2.0 brings many improvements and is released along with OneFS 8.2.0. It’s valuable to be able to upgrade OneFS from 8.x to 8.2.x or later and leverage the data management benefits of CloudPools 2.0.

This blog describes the preparations for upgrading a CloudPools environment. The purpose is to avoid potential issues when upgrading OneFS from 8.x to 8.2.x or later (that is, from CloudPools 1.0 to CloudPools 2.0).

For the recommended procedure for upgrading a CloudPools environment, see the document PowerScale CloudPools: Upgrading 8.x to 8.2.2.x or later.

For the best practices and considerations for CloudPools upgrades, see the white paper Dell PowerScale: CloudPools and ECS.

This blog covers the preparations both on cloud providers and on PowerScale clusters.

Cloud providers

CloudPools is a OneFS feature that allows customers to archive or tier data from a PowerScale cluster to cloud storage, including public cloud providers such as Amazon Web Services (AWS), Microsoft Azure, Google Cloud, Alibaba Cloud, or a private cloud based on Dell ECS.

Important: Run the isi cloud account list command to verify which cloud providers are used for CloudPools. Different authentications are used with different cloud providers for CloudPools, which might cause potential issues when upgrading a CloudPools environment.

AWS signature authentication is used for AWS, Dell ECS, and Google Cloud. In OneFS releases prior to 8.2, AWS SigV2 is only supported for CloudPools. Starting from OneFS 8.2, AWS SigV4 is added, which provides an extra level of security for authentication with the enhanced algorithm. For more information about V4, see Authenticating Requests: AWS Signature V4. AWS SigV4 will be used automatically for CloudPools in OneFS 8.2.x or later if the configurations (CloudPools and cloud providers) are correct. Please note that a different authentication is used for Azure or Alibaba Cloud.

If public cloud providers are used in a customer’s environment, there should be no issues because all configurations are already created by public cloud providers.

If Dell ECS is used in a customer’s environment, the ECS configurations are implemented separately and you need make sure that the configurations are correct on ECS, including load balancer and Domain Name System (DNS).

This section only covers the preparations for CloudPools and Dell ECS before upgrading OneFS from 8.x to 8.2.x or later.

Dell ECS

In general, CloudPools may already be archiving a lot of data from a PowerScale (Isilon) cluster to ECS before an upgrade OneFS from 8.x to 8.2.x or later. That means that most of the configurations should be created for CloudPools. For more information about CloudPools and ECS, see the white paper Dell PowerScale: CloudPools and ECS.

This section covers the following configurations for ECS before a OneFS upgrade from 8.x to 8.2.x or later.

  • Load balancer
  • DNS
  • Base URL

Load balancer

A load balancer balances traffic to the various ECS nodes from the PowerScale cluster, and can provide much better performance and throughput for CloudPools. A load balancer is strongly recommended for CloudPools 2.0 and ECS. The following white papers provide information about how to implement a load balancer with ECS:

DNS

AWS always has a wildcard DNS record configured. See the document Virtual hosting of buckets, which introduces path-style access and virtual hosted-style access for a bucket. It also shows how to associate a hostname with an Amazon S3 bucket using CNAMEs for virtual hosted-style access.

Meanwhile, the path-style URL will be deprecated on September 23, 2022. Buckets created after that date must be referenced using the virtual-hosted model. For the reasons behind moving to the virtual-hosted model, see the document Amazon S3 Path Deprecation Plan – The Rest of the Story.

ECS supports Amazon S3 compatible applications that use virtual host-style and path-style addressing schemes. (For more information, see document Bucket and namespace addressing.) And, to help ensure the proper DNS configuration for ECS, see the document DNS configuration.

The procedure for configuring DNS depends on your DNS server or DNS provider.

For example, a DNS is set up on a Windows server. The following two tables and three figures show the DNS entries created. The customer must create their own DNS entries.

Name

Record Type

FQDN

IP Address

Comment

ecs

A

ecs.demo.local

192.168.1.40

The FQDN of the load balancer will be ecs.demo.local.

 


Name

Record Type

FQDN

FQDN for
target host

Comment

cloudpools_uri

CNAME

cloudpools_uri.demo.local

ecs.demo.local

If you create an SSL certificate for the ECS S3 service, it must have the certificate and the non-wildcard version as a Subject Alternate Name.

*.cloudpools_uri

CNAME

*.cloudpools_uri.demo.local

ecs.demo.local

Used for virtual host addressing for a bucket. 


 

Base URL

In CloudPools 2.0 and ECS, a base URL must be created on ECS. For details about creating a Base URL on ECS, see the section Appendix A Base URL in the white paper Dell PowerScale: CloudPools and ECS.

When creating a new Base URL, keep the default setting (No) for Use with Namespace. Make sure that the Base URL is the FQDN alias of the load balancer virtual IP.

PowerScale clusters

If SyncIQ is configured for CloudPools, run the following commands on the source and target PowerScale cluster to check and record the CloudPools configurations, including CloudPools storage accounts, CloudPool, file pool policies, and SyncIQ policies.

# isi cloud accounts list -v
# isi cloud pools list -v
# isi filepool policies list -v
# isi sync policies list -v

For CloudPools and ECS, please be sure that URI is the FQDN alias of the load balancer virtual IP.

Important: It is strongly recommended that no job (such as for CloudPools/SmartPools, SyncIQ, and NDMP) be running before upgrading.  

In a SyncIQ environment, upgrade the SyncIQ target cluster before upgrading the source cluster. OneFS allows SyncIQ to send CP1.0 formatted SmartLink files to the target, where they will be converted into CP2.0 formatted SmartLink files. (If the source cluster is upgraded first, Sync operations will fail until both are upgraded; the only known resolution is to reconfigure the Sync policy to "Deep Copy".)

And the customer may have active (read & write) CloudPools accounts both on source and target PowerScale clusters, replicating SmartLink files of active CloudPools accounts bidirectionally. That means that the source is also a target. In this case, you need to reconfigure the Sync policy to “Deep Copy” on one of PowerScale clusters. After that, the target with replicated SmartLink files should be upgraded first.

Summary

This blog covered what you need to check, on cloud providers and PowerScale clusters, before upgrading OneFS from 8.x to 8.2.x or later (that is, from CloudPools 1.0 to CloudPools 2.0). My hope is that it can help you avoid potential CloudPools issues when upgrading a CloudPools environment.

Author: Jason He, Principal Engineering Technologist

Read Full Blog
Isilon security PowerScale OneFS

PowerScale Now Supports Secure Boot Across More Platforms

Aqib Kazi

Tue, 21 Jun 2022 19:55:15 -0000

|

Read Time: 0 minutes

Dell PowerScale OneFS 9.3.0.0 first introduced support for Secure Boot on the Dell Isilon A2000 platform. Now, OneFS 9.4.0.0 expands that support across the PowerScale A300, A3000, B100, F200, F600, F900, H700, H7000, and P100 platforms.

Secure Boot was introduced as part of the Unified Extensible Firmware Interface (UEFI) Forums of the UEFI 2.3.1 specification. The goal for Secure Boot is to ensure device security in the preboot environment by allowing only authorized EFI binaries to be loaded during the process.

The operating system boot loaders are signed with a digital signature. PowerScale Secure Boot takes the UEFI framework further by including the OneFS kernel and modules. The UEFI infrastructure is responsible for the EFI signature validation and binary loading within UEFI Secure Boot. Also, the FreeBSD veriexec function can perform signature validation for the boot loader and kernel. The PowerScale Secure Boot feature runs during the nodes’ bootup process only, using public-key cryptography to verify the signed code and ensure that only trusted code is loaded on the node.

Supported platforms

PowerScale Secure Boot is available on the following platform:

Platform

NFP version

OneFS release

Isilon A2000

11.4 or later

9.3.0.0 or later

PowerScale A300, A3000, B100, F200, F600, F900, H700, H7000, P100

11.4 or later

9.3.0.0 or later

Considerations

Before configuring the PowerScale Secure Boot feature, consider the following:

  • Isilon and PowerScale nodes are not shipped with PowerScale Secure Boot enabled. However, you can enable the feature to meet site requirements.
  • A PowerScale cluster composed of PowerScale Secure Boot enabled nodes, and PowerScale Secure Boot disabled nodes, is supported.
  • A license is not required for PowerScale Secure Boot because the feature is natively supported.
  • At any point, you can enable or disable the PowerScale Secure Boot feature.
  • Plan a maintenance window to enable or disable the PowerScale Secure Boot feature, because a node reboot is required during the process to toggle the feature.
  • The PowerScale Secure Boot feature does not impact cluster performance, because the feature is only run at bootup.

Configuration

For more information about configuring the PowerScale Secure Boot feature, see the document Dell PowerScale OneFS Secure Boot.


Author: Aqib Kazi


Read Full Blog
PowerScale OneFS

OneFS SnapRevert Job

Nick Trimbee

Tue, 21 Jun 2022 19:44:06 -0000

|

Read Time: 0 minutes

There have been a couple of recent inquiries from the field about the SnapRevert job.

For context, SnapRevert is one of three main methods for restoring data from a OneFS snapshot. The options are shown here: 

MethodDescription
CopyCopying specific files and directories directly from the snapshot
CloneCloning a file from the snapshot
RevertReverting the entire snapshot using the SnapRevert job

However, the most efficient of these approaches is the SnapRevert job, which automates the restoration of an entire snapshot to its top-level directory. This allows for quickly reverting to a previous, known-good recovery point (for example, if there is a virus outbreak). The SnapRevert job can be run from the Job Engine WebUI, and requires adding the desired snapshot ID.

 

There are two main components to SnapRevert:

  • The file system domain that the objects are put into.
  • The job that reverts everything back to what’s in a snapshot.

So, what exactly is a SnapRevert domain? At a high level, a domain defines a set of behaviors for a collection of files under a specified directory tree. The SnapRevert domain is described as a restricted writer domain, in OneFS parlance. Essentially, this is a piece of extra filesystem metadata and associated locking that prevents a domain’s files from being written to while restoring a last known good snapshot.

Because the SnapRevert domain is essentially just a metadata attribute placed onto a file/directory, a best practice is to create the domain before there is data. This avoids having to wait for DomainMark (the aptly named job that marks a domain’s files) to walk the entire tree, setting that attribute on every file and directory within it.

The SnapRevert job itself actually uses a local SyncIQ policy to copy data out of the snapshot, discarding any changes to the original directory. When the SnapRevert job completes, the original data is left in the directory tree. In other words, after the job completes, the file system (HEAD) is exactly as it was at the point in time that the snapshot was taken. The LINs for the files or directories do not change because what is there is not a copy.

To manually run SnapRevert, go to the OneFS WebUI > Cluster Management > Job Operations > Job Types > SnapRevert, and click the Start Job button.

Also, you can adjust the job’s impact policy and relative priority, if desired.

Before a snapshot is reverted, SnapshotIQ creates a point-in-time copy of the data that is being replaced. This enables the snapshot revert to be undone later, if necessary.

Also, individual files, rather than entire snapshots, can also be restored in place using the isi_file_revert command-line utility.

# isi_file_revert
usage:
isi_file_revert -l lin -s snapid
isi_file_revert -p path -s snapid
-d (debug output)
-f (force, no confirmation)

This can help drastically simplify virtual machine management and recovery, for example.

Before creating snapshots, it is worth considering that reverting a snapshot requires that a SnapRevert domain exist for the directory that is being restored. As such, we recommend that you create SnapRevert domains for those directories while the directories are empty. Creating a domain for an empty (or sparsely populated) directory takes considerably less time.

Files may belong to multiple domains. Each file stores a set of domain IDs indicating which domain they belong to in their inode’s extended attributes table. Files inherit this set of domain IDs from their parent directories when they are created or moved. The domain IDs refer to domain settings themselves, which are stored in a separate system B-tree. These B-tree entries describe the type of the domain (flags), and various other attributes.

As mentioned, a Restricted-Write domain prevents writes to any files except by threads that are granted permission to do so. A SnapRevert domain that does not currently enforce Restricted-Write shows up as (Writable) in the CLI domain listing.

Occasionally, a domain will be marked as (Incomplete). This means that the domain will not enforce its specified behavior. Domains created by the job engine are incomplete if not all files that are part of the domain are marked as being members of that domain. Since each file contains a list of domains of which it is a member, that list must be kept up to date for each file. The domain is incomplete until each file’s domain list is correct.

Besides SnapRevert, OneFS also uses domains for SyncIQ replication and SnapLock immutable archiving.

A SnapRevert domain must be created on a directory before it can be reverted to a particular point in time snapshot. As mentioned before, we recommend creating SnapRevert domains for a directory while the directory is empty.

The root path of the SnapRevert domain must be the same root path of the snapshot. For instance, a domain with a root path of /ifs/data/marketing cannot be used to revert a snapshot with a root path of /ifs/data/marketing/archive.

For example, for snapshot DailyBackup_04-27-2021_12:00 which is rooted at /ifs/data/marketing/archive, you would perform the following:

1. Set the SnapRevert domain by running the DomainMark job (which marks all files).

# isi job jobs start domainmark --root /ifs/data/marketing --dm-type SnapRevert

2. Verify that the domain has been created.

# isi_classic domain list –l

To restore a directory back to the state it was in at the point in time when a snapshot was taken, you need to:

  • Create a SnapRevert domain for the directory
  • Create a snapshot of a directory

 To accomplish this, do the following:

1. Identify the ID of the snapshot you want to revert by running the isi snapshot snapshots view command and picking your point in time (PIT).

For example:

# isi snapshot snapshots view DailyBackup_04-27-2021_12:00
ID: 38
Name: DailyBackup_04-27-2021_12:00
Path: /ifs/data/marketing
Has Locks: No
Schedule: daily
Alias: -
Created: 2021-04-27T12:00:05
Expires: 2021-08-26T12:00:00
Size: 0b
Shadow Bytes: 0b
% Reserve: 0.00%
% Filesystem: 0.00%
State: active

2. Revert to a snapshot by running the isi job jobs start command. The following command reverts to snapshot ID 38 named DailyBackup_04-27-2021_12:00.

# isi job jobs start snaprevert --snapid 38

You can also perform this action from the WebUI. Go to Cluster Management > Job Operations > Job Types > SnapRevert, and click the Start Job button.

OneFS automatically creates a snapshot before the SnapRevert process reverts the specified directory tree. The naming convention for these snapshots is of the form: <snapshot_name>.pre_revert.*

# isi snap snap list | grep pre_revert
39 DailyBackup_04-27-2021_12:00.pre_revert.1655328160 /ifs/data/marketing

This allows for an easy rollback of a SnapRevert if the desired results are not achieved.

If a domain is currently preventing the modification or deletion of a file, a protection domain cannot be created on a directory that contains that file. For example, if files under /ifs/data/smartlock are set to a WORM state by a SmartLock domain, OneFS will not allow a SnapRevert domain to be created on /ifs/data/.

If desired or required, SnapRevert domains can also be deleted using the job engine CLI. For example, to delete the SnapRevert domain at /ifs/data/marketing:

# isi job jobs start domainmark --root /ifs/data/marketing --dm-type SnapRevert --delete

 

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS data access

Data Access in OneFS - Part 1: Introduction to OneFS File Permissions

Lieven Lin

Thu, 16 Jun 2022 20:29:24 -0000

|

Read Time: 0 minutes

About this blog series

Have you ever been confused about PowerScale OneFS file system multi-protocol data access? If so, this blog series will help you out. We’ll try to demystify OneFS multi-protocol data access. Different Network Attached Storage vendors have different designs for implementing multi-protocol data access. In OneFS multi-protocol data access, you can access the same set of data consistently with different operating systems and protocols.

To make it simple, the overall data access process in OneFS includes:

  1. When a client user tries to access OneFS cluster data by means of protocols (such as SMB, NFS, and S3), OneFS must first authenticate the client user.
  2. When the authentication succeeds, OneFS checks whether the user has permission on file share, where the access level depends on your access protocol, such as SMB share, NFS export, or S3 bucket.
  3. Only when the user is authorized to have permission on the file shares will OneFS apply user mapping rules and generate an access token for the user in most cases. The access token contains the following information:
  • The user's Security Identifier (SID), User Identifier (UID), and Group Identifier (GID).
  • The user's supplemental groups
  • The user's role-based access control (RBAC) privileges

Finally, OneFS enforces the permissions on the target data for the user. This process evaluates the file permissions based on the user's access token and file share level permissions.

Does it sound simple but some details still confusing? Like, what exactly are UIDs, GIDs, and SIDs? What’s an access token? How does OneFS evaluate the file permissions? and so on. Don’t worry if you are not familiar with these concepts. Keep reading and we’ll explain!

To make it easier, we will start with OneFS file permissions, and then introduce OneFS access tokens. Finally, we will see how data access depends on the protocol you use.

In this blog series, we’ll cover the following topics:

  • Data Access in OneFS - Part 1: Introduction to OneFS File Permissions
  • Data Access in OneFS - Part 2: Introduction to OneFS Access Tokens
  • Data Access in OneFS - Part 3: Why Use Different Protocols?
  • Data Access in OneFS - Part 4: Using NFSv3 and NFSv4.x
  • Data Access in OneFS - Part 5: Using SMB
  • Data Access in OneFS - Part 6: Using S3
  • More to add…

Now let's have a look at OneFS file permissions. In a multi-protocol environment, the OneFS operating system is designed to support basic POSIX mode bits and Access Control Lists (ACLs). Therefore, two file permission states are designated:

  • POSIX mode bits - authoritative with a synthetic ACL
  • OneFS ACL - authoritative with approximate POSIX mode bits

POSIX mode bits - authoritative with a synthetic ACL

POSIX mode bits only define three specific permissions: read(r), write(w), and execute(x). Meanwhile, there are three classes to which you can assign permissions: Owner, Group, and Others.

  • Owner: represents the owner of a file/directory.
  • Group: represents the group of a file/directory.
  • Others: represents the users who are not the owner, nor a member of the group.

The ls -le command displays a file’s permissions; the ls -led command displays a directory’s permissions. If it has these permissions:

-rw-rw-r--

then:

-rw-rw-r--         means that the owner has read and write permissions 

-rw-rw-r--         means that the group has read and write permissions

-rw-rw-r--         means that all others have only read permissions

In the following example for the file posix-file.txt, the file owner Joe has read and write access permissions, the file group Market has read and write access permissions, and all others only have read access permissions.

Also displayed here is the synthetic ACL (shown beneath the SYNTHETIC ACL flag) which indicates that the file is in the POSIX mode bit file permission state. There are three Access Control Entities (ACEs) created for the synthetic ACL, all of which is another way of representing the file’s POSIX mode bits permissions.

vonefs-aima-1# ls -le posix-file.txt
-rw-rw-r--     1 Joe  Market   65 May 28 02:08 posix-file.txt
 OWNER: user:Joe
 GROUP: group:Market
 SYNTHETIC ACL
 0: user:Joe allow file_gen_read,file_gen_write,std_write_dac
 1: group:Market allow file_gen_read,file_gen_write
 2: everyone allow file_gen_read

When OneFS receives a user access request, it generates an access token for the user and compares the token to the file permissions – in this case, the POSIX mode bits.  

OneFS ACL authoritative with approximate POSIX mode bits

In contrast to POSIX mode bits, OneFS ACLs support more expressive permissions. (For all available permissions, which are listed in Table 1 through Table 3 of the documentation, see Access Control Lists on Dell EMC PowerScale OneFS.) A OneFS ACL consists of one or more Access Control Entries (ACEs). A OneFS ACE contains the following information:

  • ACE index: indicates the ACE order in an ACL
  • Identity type: indicates the identity type, supported identity type including user, group, everyone, creator_owner, creator_group, or owner_rights
  • Identity ID: in OneFS, the UID/GID/SID is stored on disk instead of user names or group names. The name of a user or group is for display only.
  • ACE type: The type of the ACE (allow or deny)
  • ACE permissions and inheritance flags: A list of permissions and inheritance flags separated by commas

For example, the ACE "0: group:Engineer allow file_gen_read,file_gen_execute" indicates that its index is 0, and allows the group called Engineer to have file_gen_read and file_gen_execute access permissions.

The following example shows a full ACL for a file. Although there is no SYNTHETIC ACL flag, there is a "+" character following the POSIX mode bits that indicates that the file is in the OneFS real ACL state. The file’s OneFS ACL grants full permission to users Joe and Bob. It also grants file_gen_read and file_gen_execute permissions to the group Market and to everyone. In this case, the POSIX mode bits are for representation only: you cannot tell the accurate file permissions from the approximate POSIX mode bits. You should therefore always rely on the OneFS ACL to check file permissions.

vonefs-aima-1# ls -le acl-file.txt
-rwxrwxr-x +   1 Joe  Market   69 May 28 01:08 acl-file.txt
 OWNER: user:Joe
 GROUP: group:Market
 0: user:Joe allow file_gen_all
 1: group:Market allow file_gen_read,file_gen_execute
 2: user:Bob allow file_gen_all
 3: everyone allow file_gen_read,file_gen_execute

No matter the OneFS file permission state, the on-disk identity for a file is always a UID, a GID, or an SID. So, for the above two files, file permissions stored on disk are:

vonefs-aima-1# ls -len posix-file.txt
-rw-rw-r--     1 2001  2003   65 May 28 02:08 posix-file.txt
 OWNER: user:2001
 GROUP: group:2003
 SYNTHETIC ACL
 0: user:2001 allow file_gen_read,file_gen_write,std_write_dac
 1: group:2003 allow file_gen_read,file_gen_write
 2: SID:S-1-1-0 allow file_gen_read
 
vonefs-aima-1# ls -len acl-file.txt
-rwxrwxr-x +   1 2001  2003   69 May 28 01:08 acl-file.txt
 OWNER: user:2001
 GROUP: group:2003
 0: user:2001 allow file_gen_all
 1: group:2003 allow file_gen_read,file_gen_execute
 2: user:2002 allow file_gen_all
 3: SID:S-1-1-0 allow file_gen_read,file_gen_execute

When OneFS receives a user access request, it generates an access token for the user and compares the token to the file permissions. OneFS grants access when the file permissions include an ACE that allows the identity in the token to access the file, and does not include an ACE that denies the identity access.

When evaluating the file permission for a user's access token, OneFS checks the ACEs one by one by following the ACEs index order and stops checking when the following conditions are met:

  • All of the required permissions for the user access request are allowed by ACLs, and the access request is authorized.
  • Any one of the required permissions for the user access request is explicitly denied by ACLs, and the access request is denied.
  • All ACEs have been checked, but not all required permissions for the user access request are allowed by ACLs, then the access request is also denied.

Let’s say we have a file named acl-file01.txt that has the file permissions shown below. When user Bob tries to read the data of the file, OneFS checks the ACEs from index 0 to index 3. When checking ACE index 1, it explicitly denies Bob read data permissions. The ACLs then stop checking, and read access is denied.

vonefs-aima-1# ls -le acl-file01.txt
-rwxrw-r-- +   1 Joe  Market   12 May 28 06:19 acl-file01.txt
 OWNER: user:Joe
 GROUP: group:Market
 0: user:Joe allow file_gen_all
 1: user:Bob deny file_gen_read
 2: user:Bob allow file_gen_read,file_gen_write
 3: everyone allow file_gen_read

Now let’s say that we still have the file named acl-file01.txt, but the file permissions are now a little different, as shown below. When user Bob tries to read the data of the file, OneFS checks the ACEs from index 0 to index 3. When checking ACE index 1, it explicitly allows Bob to have read permissions. The ACLs checking process therefore ends, and read access is authorized. Therefore, it is recommended to put all “deny” ACEs in front of “allow” ACEs if you want to explicitly deny specific permissions for specific users/groups.

vonefs-aima-1# ls -le acl-file01.txt
-rwxrw-r-- +   1 Joe  Market   12 May 28 06:19 acl-file01.txt
 OWNER: user:Joe
 GROUP: group:Market
 0: user:Joe allow file_gen_all
 1: user:Bob allow file_gen_read,file_gen_write
 2: user:Bob deny file_gen_read
 3: everyone allow file_gen_read

File permission state changes

As mentioned before, a file can only be in one state at a time. However, the file permission state of the file may be flipped. If a file is in POSIX, it can be flipped to an ACL file by modifying the permissions using SMB/NFSv4 clients or by using the chmod command in OneFS. If a file is in ACL, it can be flipped to a POSIX file, by using the OneFS CLI command: chmod –b XXX <filename>. The ‘XXX’ specifies the new POSIX permission. For more examples, see File permission state changes.

Now, you should be able to check a file’s permission on OneFS with the command ls -len filename, and check a directory’s permissions on OneFS with the command ls -lend directory_name.

In my next blog, we will cover what an access token is and how to check a user’s access token!

Resources

Author: Lieven Lin

Read Full Blog
PowerScale data management OneFS data reduction

Understanding ‘Total inlined data savings’ When Using ’isi_cstats’

Yunlong Zhang

Thu, 12 May 2022 14:22:45 -0000

|

Read Time: 0 minutes

Recently a customer contacted us to tell us that he thought that there was an error in the output of the OneFS CLI command ‘isi_cstats’. Starting with OneFS 9.3, the ‘isi_cstats’ command includes the accounted number of inlined files within /ifs. It also contains a statistic called “Total inlined data savings”.

This customer expected that the ‘Total inlined data savings’ number was simply ‘Total inlined files’ multiplied by 8KB. The reason he thought this number was wrong was that this number does not consider the protection level. 

In OneFS, for the 2d:1n protection level, each file smaller than 128KB is stored as 3X mirrors. Take the screenshot below as an example.

 

If we do some calculation here,

379,948,336 * 8KB = 3,039,586,688KiB = 2898.78GiB

we can see that the 2,899GiB from the command output is calculated as one block per inlined file. So, in our example, the customer would think that ‘Total inlined data savings’ should report 2898.78 GiB * 3, because of the 2d:1n protection level. 

Well, this statistic is not the actual savings, it is really the logical on-disk cost for all inlined files. We can't accurately report the physical savings because it depends on the non-inlined protection overhead, which can vary. For example:

  • If the protection level is 2d:1n, without the data inlining in 8KB inode feature, each of the inlined files would cost 8KB * 3.
  • If the protection level is 3d:1n1d, it will become 8KB * 4.

One more thing to consider, if a file is smaller than 8KB after compression, it will be inlined into an inode as well. Therefore, this statistic doesn't represent logical savings either, because it doesn't take compression into account. To report the logical savings, total logical size for all inlined files should be tracked.

To avoid any confusion, we plan to rename this statistic to “Total inline data” in the next version of OneFS. We also plan to show more useful information about total logical data of inlined files, in addition to “Total inline data”.

For more information about the reporting of data reduction features, see the white paper   PowerScale OneFS: Data Reduction and Storage Efficiency on the Info Hub.

Author: Yunlong Zhang, Principal Engineering Technologist

Read Full Blog
PowerScale data management OneFS

OneFS Data Reduction and Efficiency Reporting

Nick Trimbee

Wed, 04 May 2022 14:36:26 -0000

|

Read Time: 0 minutes

Among the objectives of OneFS reduction and efficiency reporting is to provide ‘industry standard’ statistics, allowing easier comprehension of cluster efficiency. It’s an ongoing process, and prior to OneFS 9.2 there was limited tracking of certain filesystem statistics – particularly application physical and filesystem logical – which meant that data reduction and storage efficiency ratios had to be estimated. This is no longer the case, and OneFS 9.2 and later provides accurate data reduction and efficiency metrics at a per-file, quota, and cluster-wide granularity.

The following table provides descriptions for the various OneFS reporting metrics, while also attempting to rationalize their naming conventions with other general industry terminology:

OneFS Metric

Also Known As

Description

Protected logical

Application logical

Data size including sparse data, zero block eliminated data, and CloudPools data stubbed to a cloud tier.

Logical data

Effective

 

Filesystem logical

Data size excluding protection overhead and sparse data, and including data efficiency savings (compression and deduplication).

Zero-removal saved

 

Capacity savings from zero removal.

Dedupe saved

 

Capacity savings from deduplication.

Compression saved

 

Capacity savings from in-line compression.

Preprotected physical

Usable

 

Application physical

Data size excluding protection overhead and including storage efficiency savings.

Protection overhead

 

Size of erasure coding used to protect data.

Protected physical

Raw

 

Filesystem physical

Total footprint of data including protection overhead FEC erasure coding) and excluding data efficiency savings (compression and deduplication).

Dedupe ratio

 

Deduplication ratio. Will be displayed as 1.0:1 if there are no deduplicated blocks on the cluster.

Compression ratio

 

Usable reduction ratio from compression, calculated by dividing ‘logical data’ by ‘preprotected physical’ and expressed as x:1.

Inlined data ratio

 

Efficiency ratio from storing small files’ data within their inodes, thereby not requiring any data or protection blocks for their storage.

Data reduction ratio

Effective to Usable

Usable efficiency ratio from compression and deduplication. Will display the same value as the compression ratio if there is no deduplication on the cluster.

Efficiency ratio

Effective to Raw

Overall raw efficiency ratio expressed as x:1

So let’s take these metrics and look at what they represent and how they’re calculated.

  • Application logical, or protected logical, is the application data that can be written to the cluster, irrespective of where it’s stored.
  • Removing the sparse data from application logical results in filesystem logical, also known simply as logical data or effective. This can be data that was always sparse, was zero block eliminated, or data that has been tiered off-cluster by means of CloudPools, and so on.

  (Note that filesystem logical was not accurately tracked in releases prior to OneFS 9.2, so metrics prior to this were somewhat estimated.)

  • Next, data reduction techniques such as compression and deduplication further reduce filesystem logical to application physical, or pre-protected physical. This is the physical size of the application data residing on the filesystem drives, and does not include metadata, protection overhead, or data moved to the cloud.

  • Filesystem physical is application physical with data protection overhead added – including inode, mirroring, and FEC blocks. Filesystem physical is also referred to as protected physical.

  • The data reduction ratio is the amount that’s been reduced from the filesystem logical down to the application physical.

  • Finally, the storage efficiency ratio is the filesystem logical divided by the filesystem physical.

With the enhanced data reduction reporting in OneFS 9.2 and later, the actual statistics themselves are largely the same, just calculated more accurately.

The storage efficiency data was available in releases prior to OneFS 9.2, albeit somewhat estimated, but the data reduction metrics were introduced with OneFS 9.2.

The following tools are available to query these reduction and efficiency metrics at file, quota, and cluster-wide granularity:

Realm

OneFS Command

OneFS Platform API

File

isi get -D


Quota

isi quota list -v

12/quota/quotas

Cluster-wide

isi statistics data-reduction

1/statistics/current?key=cluster.data.reduce.*

Detailed Cluster-wide

isi_cstats

1/statistics/current?key=cluster.cstats.*

Note that the ‘isi_cstats’ CLI command provides some additional, behind-the-scenes details. The interface goes through platform API to fetch these stats.

The ‘isi statistics data-reduction’ CLI command is the most comprehensive of the data reduction reporting CLI utilities. For example:

# isi statistics data-reduction
                      Recent Writes Cluster Data Reduction
                           (5 mins)
--------------------- ------------- ----------------------
Logical data                  6.18M                  6.02T
Zero-removal saved                0                      -
Deduplication saved          56.00k                  3.65T
Compression saved             4.16M                  1.96G
Preprotected physical         1.96M                  2.37T
Protection overhead           5.86M                910.76G
Protected physical            7.82M                  3.40T
Zero removal ratio         1.00 : 1                      -
Deduplication ratio        1.01 : 1               2.54 : 1
Compression ratio          3.12 : 1               1.02 : 1
Data reduction ratio       3.15 : 1               2.54 : 1
Inlined data ratio         1.04 : 1               1.00 : 1
Efficiency ratio           0.79 : 1               1.77 : 1

The ‘recent writes’ data in the first column provides precise statistics for the five-minute period prior to running the command. By contrast, the ‘cluster data reduction’ metrics in the second column are slightly less real-time but reflect the overall data and efficiencies across the cluster. Be aware that, in OneFS 9.1 and earlier, the right-hand column metrics are designated by the ‘Est’ prefix, denoting an estimated value. However, in OneFS 9.2 and later, the ‘logical data’ and ‘preprotected physical’ metrics are tracked and reported accurately, rather than estimated.

The ratio data in each column is calculated from the values above it. For instance, to calculate the data reduction ratio, the ‘logical data’ (effective) is divided by the ‘preprotected physical’ (usable) value. From the output above, this would be:

6.02 / 2.37 = 1.76              Or a Data Reduction ratio of 2.54:1

Similarly, the ‘efficiency ratio’ is calculated by dividing the ‘logical data’ (effective) by the ‘protected physical’ (raw) value. From the output above, this yields:

6.02 / 3.40 = 0.97              Or an Efficiency ratio of 1.77:1

OneFS SmartQuotas reports the capacity saving from in-line data reduction as a storage efficiency ratio. SmartQuotas reports efficiency as a ratio across the desired data set as specified in the quota path field. The efficiency ratio is for the full quota directory and its contents, including any overhead, and reflects the net efficiency of compression and deduplication. On a cluster with licensed and configured SmartQuotas, this efficiency ratio can be easily viewed from the WebUI by navigating to File System > SmartQuotas > Quotas and Usage. In OneFS 9.2 and later, in addition to the storage efficiency ratio, the data reduction ratio is also displayed. 

Similarly, the same data can be accessed from the OneFS command line by using the ‘isi quota quotas list’ CLI command. For example:

# isi quota quotas list
Type    AppliesTo   Path  Snap  Hard   Soft  Adv  Used   Reduction  Efficiency
----------------------------------------------------------------------------
directory DEFAULT    /ifs   No    -     -      -    6.02T 2.54 : 1   1.77 : 1
----------------------------------------------------------------------------

Total: 1

More detail, including both the physical (raw) and logical (effective) data capacities, is also available by using the ‘isi quota quotas view <path> <type>’ CLI command. For example:

# isi quota quotas view /ifs directory
                        Path: /ifs
                        Type: directory
                   Snapshots: No
                    Enforced: No
                   Container: No
                      Linked: No
                       Usage
                           Files: 5759676
         Physical(With Overhead): 6.93T
        FSPhysical(Deduplicated): 3.41T
         FSLogical(W/O Overhead): 6.02T
        AppLogical(ApparentSize): 6.01T
                   ShadowLogical: -
                    PhysicalData: 2.01T
                      Protection: 781.34G
     Reduction(Logical/Data): 2.54 : 1
Efficiency(Logical/Physical): 1.77 : 1

To configure SmartQuotas for in-line data efficiency reporting, create a directory quota at the top-level file system directory of interest, for example /ifs. Creating and configuring a directory quota is a simple procedure and can be performed from the WebUI by navigating to File System > SmartQuotas > Quotas and Usage and selecting Create a Quota. In the Create a quota dialog, set the Quota type to ‘Directory quota’, add the preferred top-level path to report on, select ’Application logical size’ for Quota Accounting, and set the Quota Limits to ‘Track storage without specifying a storage limit’. Finally, click the ‘Create Quota’ button to confirm the configuration and activate the new directory quota.

The efficiency ratio is a single, current-in time efficiency metric that is calculated per quota directory and includes the sum of in-line compression, zero block removal, in-line dedupe, and SmartDedupe. This is in contrast to a history of stats over time, as reported in the ‘isi statistics data-reduction’ CLI command output, described above. As such, the efficiency ratio for the entire quota directory will reflect what is actually there.

Author: Nick Trimbee

Read Full Blog
data storage PowerScale OneFS

OneFS In-line Dedupe

Nick Trimbee

Mon, 02 May 2022 18:43:40 -0000

|

Read Time: 0 minutes

Among the features and functionality delivered in the new OneFS 9.4 release is the promotion of in-line dedupe to enabled by default, further enhancing PowerScale’s dollar-per-TB economics, rack density and value.

Part of the OneFS data reduction suite, in-line dedupe initially debuted in OneFS 8.2.1. However, it was enabled manually, so many customers simply didn’t use it. But with this enhancement, new clusters running OneFS 9.4 now have in-line dedupe enabled by default.

Cluster configuration

In-line dedupe

In-line compression

New cluster running OneFS 9.4

Enabled

Enabled

New cluster running OneFS 9.3 or earlier

Disabled

Enabled

Cluster with in-line dedupe enabled that is upgraded to OneFS 9.4

Enabled

Enabled

Cluster with in-line dedupe disabled that is upgraded to OneFS 9.4

Disabled

Enabled

That said, any clusters that upgrade to 9.4 will not see any change to their current in-line dedupe config during upgrade. Also, there is also no change to the behavior for in-line compression, which remains enabled by default in all OneFS versions from 8.1.3 onwards.

But before we examine the-under-the-hood changes in OneFS 9.4, let’s have a quick dedupe refresher.

Currently, OneFS in-line data reduction, which encompasses compression, dedupe, and zero block removal, is supported on the F900, F600, and F200 all-flash nodes, plus the F810, H5600, H700/7000, and A300/3000 Gen6.x chassis.

Within the OneFS data reduction pipeline, zero block removal is performed first, followed by dedupe, and then compression. This order allows each phase to reduce the scope of work each subsequent phase.

Unlike SmartDedupe, which performs deduplication once data has been written to disk, or post-process, in-line dedupe acts in real time, deduplicating data as is ingested into the cluster. Storage efficiency is achieved by scanning the data for identical blocks as it is received and then eliminating the duplicates.

When in-line dedupe discovers a duplicate block, it moves a single copy of the block to a special set of files known as shadow stores. These are file-system containers that allow data to be stored in a sharable manner. As such, files stored under OneFS can contain both physical data and pointers, or references, to shared blocks in shadow stores.

Shadow stores are similar to regular files but are hidden from the file system namespace, so they cannot be accessed through a pathname. A shadow store typically grows to a maximum size of 2 GB, which is around 256 K blocks, and each block can be referenced by 32,000 files. If the reference count limit is reached, a new block is allocated, which may or may not be in the same shadow store. Also, shadow stores do not reference other shadow stores. And snapshots of shadow stores are not permitted because the data contained in shadow stores cannot be overwritten.

When a client writes a file to a node pool configured for in-line dedupe on a cluster, the write operation is divided up into whole 8 KB blocks. Each block is hashed, and its cryptographic ‘fingerprint’ is compared against an in-memory index for a match. At this point, one of the following will happen:

  1. If a match is discovered with an existing shadow store block, a byte-by-byte comparison is performed. If the comparison is successful, the data is removed from the current write operation and replaced with a shadow reference.
  2. When a match is found with another LIN, the data is written to a shadow store instead and is replaced with a shadow reference. Next, a work request is generated and queued that includes the location for the new shadow store block, the matching LIN and block, and the data hash. A byte-by-byte data comparison is performed to verify the match and the request is then processed.
  3. If no match is found, the data is written to the file natively and the hash for the block is added to the in-memory index.

For in-line dedupe to perform on a write operation, the following conditions need to be true:

  • In-line dedupe must be globally enabled on the cluster.
  • The current operation is writing data (not a truncate or write zero operation).
  • The no_dedupe flag is not set on the file.
  • The file is not a special file type, such as an alternate data stream (ADS) or an EC (endurant cache) file.
  • Write data includes fully overwritten and aligned blocks.
  • The write is not part of a rehydrate operation.
  • The file has not been packed (containerized) by small file storage efficiency (SFSE).

 OneFS in-line dedupe uses the 128-bit CityHash algorithm, which is both fast and cryptographically strong. This contrasts with the OneFS post-process SmartDedupe, which uses SHA-1 hashing.

Each node in a cluster with in-line dedupe enabled has its own in-memory hash index that it compares block fingerprints against. The index lives in system RAM and is allocated using physically contiguous pages and is accessed directly with physical addresses. This avoids the need to traverse virtual memory mappings and does not incur the cost of translation lookaside buffer (TLB) misses, minimizing dedupe performance impact.

The maximum size of the hash index is governed by a pair of sysctl settings, one of which caps the size at 16 GB, and the other which limits the maximum size to 10% of total RAM. The strictest of these two constraints applies. While these settings are configurable, the recommended best practice is to use the default configuration. Any changes to these settings should only be performed under the supervision of Dell support.

Since in-line dedupe and SmartDedupe use different hashing algorithms, the indexes for each are not shared directly. However, the work performed by each dedupe solution can be used by each other. For instance, if SmartDedupe writes data to a shadow store, when those blocks are read, the read-hashing component of in-line dedupe sees those blocks and indexes them.

When a match is found, in-line dedupe performs a byte-by-byte comparison of each block to be shared to avoid the potential for a hash collision. Data is prefetched before the byte-by-byte check and is compared against the L1 cache buffer directly, avoiding unnecessary data copies and adding minimal overhead. Once the matching blocks are compared and verified as identical, they are shared by writing the matching data to a common shadow store and creating references from the original files to this shadow store.

In-line dedupe samples every whole block that is written and handles each block independently, so it can aggressively locate block duplicity. If a contiguous run of matching blocks is detected, in-line dedupe merges the results into regions and processes them efficiently.

In-line dedupe also detects dedupe opportunities from the read path, and blocks are hashed as they are read into L1 cache and inserted into the index. If an existing entry exists for that hash, in-line dedupe knows there is a block-sharing opportunity between the block it just read and the one previously indexed. It combines that information and queues a request to an asynchronous dedupe worker thread. As such, it is possible to deduplicate a data set purely by reading it all. To help mitigate the performance impact, the hashing is performed out-of-band in the prefetch path, rather than in the latency-sensitive read path.

The original in-line dedupe control path design had its limitations, since it did not provide gconfig control settings for the default-disabled in-line dedupe. The previous control-path logic had no gconfig control settings for default-disabled in-line dedupe. But in OneFS 9.4, there are now two separate features that interact together to distinguish between a new cluster or an upgrade to an existing cluster configuration:

For the first feature, upon upgrade to 9.4 on an existing cluster, if there is no in-line dedupe config present, the upgrade explicitly sets it to disabled in gconfig. This has no effect on an existing cluster since it’s already disabled. Similarly, if the upgrading cluster already has an existing in-line dedupe setting in gconfig, OneFS takes no action.

For the other half of the functionality, when booting OneFS 9.4, a node looks in gconfig to see if there’s an in-line dedupe setting. If no config is present, OneFS enables it by default. Therefore, new OneFS 9.4 clusters automatically enable dedupe, and existing clusters retain their legacy setting upon upgrade.

Since the in-line dedupe configuration is binary (either on or off across a cluster), you can easily control it manually through the OneFS command line interface (CLI). As such, the isi dedupe inline settings modify CLI command can either enable or disable dedupe at will—before, during, or after the upgrade. It doesn’t matter.

For example, you can globally disable in-line dedupe and verify it using the following CLI command:

# isi dedupe inline settings viewMode: enabled# isi dedupe inline settings modify –-mode disabled
# isi dedupe inline settings view
Mode: disabled

Similarly, the following syntax enables in-line dedupe:

# isi dedupe inline settings view
Mode: disabled
# isi dedupe inline settings modify –-mode enabled
# isi dedupe inline settings view
Mode: enabled

While there are no visible userspace changes when files are deduplicated, if deduplication has occurred, both the disk usage and the physical blocks metrics reported by the isi get –DD CLI command are reduced. Also, at the bottom of the command’s output, the logical block statistics report the number of shadow blocks. For example:

Metatree logical blocks:    zero=260814 shadow=362 ditto=0 prealloc=0 block=2 compressed=0

In-line dedupe can also be paused from the CLI:

# isi dedupe inline settings modify –-mode paused
# isi dedupe inline settings view
Mode: paused

However, it’s worth noting that this global setting states what you’d like to happen, after which each node attempts to enact the new configuration. However, it can’t guarantee the change, because not all node types support in-line dedupe. For example, the following output is from a heterogenous cluster with an F200 three-node pool supporting in-line dedupe, and an H400 four-node pool not supporting it.

Here, we can see that in-line dedupe is globally enabled on the cluster:

# isi dedupe inline settings view
Mode: enabled

However, you can use the isi_for_array isi_inline_dedupe_status command to display the actual setting and state of each node:

# isi dedupe inline settings view
Mode: enabled
# isi_for_array -s isi_inline_dedupe_status
1: OK Node setting enabled is correct
2: OK Node setting enabled is correct
3: OK Node setting enabled is correct
4: OK Node does not support inline dedupe and current is disabled
5: OK Node does not support inline dedupe and current is disabled
6: OK Node does not support inline dedupe and current is disabled
7: OK Node does not support inline dedupe and current is disabled

Also, any changes to the dedupe configuration are also logged to /var/log/messages, you can find them by grepping for inline_dedupe.

In a nutshell, in-line compression has always been enabled by default since its introduction in OneFS 8.1.3. For new clusters running 9.4 and above, in-line dedupe is on by default. For clusters running 9.3 and earlier, in-line dedupe remains disabled by default. And existing clusters that upgrade to 9.4 will not see any change to their current in-line dedupe config during upgrade.

And here’s the OneFS in-line data reduction platform support matrix for good measure:

Read Full Blog
PowerScale OneFS performance metrics

PowerScale Update: QLC Support, Incredible Performance and TCO

David Noy

Mon, 02 May 2022 13:52:08 -0000

|

Read Time: 0 minutes

Dell PowerScale is known for its exceptional feature set, which offers scalability, flexibility and simplicity.  Our customers frequently start with one workload such as file share consolidation or mixed media storage and then scale-out OneFS to support all types of workloads leveraging the simple, cloud-like single pool storage architecture.  

To provide our customers with even more flexibility and choice, this summer we will introduce new Quad-level cell (QLC) flash memory 15TB and 30TB drives for our PowerScale F900 and F600 all-flash models. And we are seeing an up to 25% or more better performance for streaming reads, depending on workload, with all-flash nodes in the subsequent PowerScale OneFS release.1

Delivering latest-generation, Gen 2 QLC Support

With the many important and needed improvements in reliability and performance delivered by Gen 2 QLC technology, we’ve reached the optimal point in the development of QLC technology to deliver QLC flash drives for the PowerScale F900 and F600 all-flash models. These new QLC drives, supported by the currently shipping OneFS 9.4 release, will offer our customers incredible economics for fast NAS workloads that need both performance and capacity – such as financial modeling, media and entertainment, artificial intelligence (AI), machine learning (ML), and deep learning (DL). With 30TB QLC drive support, we are able to increase the raw density per node to 720TB for PowerScale F900 and 240TB for PowerScale F600 – and lower the cost of flash for our customers.  

OneFS.next Performance Boost 

Another emerging PowerScale feature of interest, targeted for an upcoming OneFS software release, is a major performance enhancement that will unlock streaming read throughput gains of up to 25% or more, depending on workload, for our flagship all-flash PowerScale F-series NVMe platforms.1 This significant performance boost will be of particular benefit to customers with high throughput, streaming read-heavy workloads, such as media and entertainment hi-res playout, ADAS for the automotive industry, and financial services high frequency, complex trading queries. Pairing nicely with the aforementioned performance boost is PowerScale’s support for NFS over RDMA (NFSoRDMA), which can further accelerate high throughput performance, especially for single connection and read intensive workloads such as machine learning – while also dramatically reducing both cluster and client CPU utilization. 

All Together Now

Further, these drives become part of the overall life cycle management system within OneFS. This gives PowerScale a major TCO advantage over the competition. In harmony with this forthcoming streaming performance enhancement, OneFS’s non-disruptive upgrade framework will enable existing PowerScale environments to seamlessly and non-disruptively up-rev their cluster software and enjoy this major performance boost on PowerScale F900 and F600 pools – free from any hardware addition, modification, reconfiguration, intervention, or downtime. 

These are just a few of the exciting things we have in the works for PowerScale, the world’s most flexible scale-out NAS solution.2

If you are attending Dell Technologies World, check out these sessions for more about our PowerScale innovations.  

  • Discover the latest Enhancements to PowerScale for Unstructured Storage Solutions
    • May 3 at 12 p.m. in Lando 4205
  • Improve Threat Detection, Isolation and Data Recovery with PowerScale Cyber Protection
    • May 3 or May 4 at 3 p.m. in Lando 4205
  • Top 10 Tips to Get More out of Your PowerScale Investment
    • May 3 at 12 p.m. in Palazzo I
  • Ask the Experts: Harness the Power of Your Unstructured Data
    • May 4 at 3 p.m. in Zeno 4601

_________________ 

Based on Dell internal testing, April 2022. Actual results will vary.

Based on internal Dell analysis of publicly available information, August 2021.

Author: David Noy, Vice President of Product Management, Unstructured Data Solutions and Data Protection Solutions, Dell Technologies



Read Full Blog
PowerScale OneFS

Announcing PowerScale OneFS 9.4!

Nick Trimbee

Mon, 04 Apr 2022 16:09:12 -0000

|

Read Time: 0 minutes

Arriving in time for Dell Technologies World 2022, the new PowerScale OneFS 9.4 release shipped on Monday 4th April 2022. 

OneFS 9.4 brings with it a wide array of new features and functionality, including:

Feature

Description

SmartSync Data Mover

  • Introduction of a new OneFS SmartSync data mover, allowing flexible data movement and copying, incremental resyncs, push and pull data transfer, and one-time file to object copy. Complementary to SyncIQ, SmartSync provides an additional option for data transfer, including to object storage targets such as ECS, AWS, and Azure.

IB to Ethernet Backend Migration

  • Non-disruptive rolling InfiniBand to Ethernet back-end network migration for legacy Gen6 clusters.

Secure Boot

  • ·       Secure boot support is extended to include the F900, F600, F200, H700/7000, and A700/7000 platforms.

Smarter SmartConnect Diagnostics

  • Identifies non-resolvable nodes and provides their detailed status, allowing the root cause to be easily pinpointed.

In-line Dedupe

  • In-line deduplication will be enabled by default on new OneFS 9.4 clusters. Clusters upgraded to OneFS 9.4 will maintain their current dedupe configuration.

Healthcheck Auto-updates

  • Automatic monitoring, download, and installation of new healthcheck packages as they are released.

CloudIQ Protocol Statistics

  • New protocol statistics ‘count’ keys are added, allowing performance to be measured over a specified time window and providing point-in-time protocol information.

SRS Alerts and CELOG Event Limiting

  • Prevents CELOG from sending unnecessary event types to Dell SRS and restricts CELOG alerts from customer-created channels.

CloudPools Statistics

  • Automated statistics gathering on CloudPools accounts and policies, providing insights for planning and troubleshooting CloudPools-related activities. 

We’ll be taking a deeper look at some of these new features in blog articles over the course of the next few weeks. 

Meanwhile, the new OneFS 9.4 code is available for download on the Dell Online Support site, in both upgrade and reimage file formats. 

For upgrading existing clusters, the recommendation is to open a Service Request with Dell Online Support to schedule an upgrade. To provide a consistent and positive upgrade experience, Dell is offering assisted upgrades to OneFS 9.4 at no cost to customers with a valid support contract. Please refer to Knowledge Base article KB544296 for additional information on how to initiate the upgrade process.  

Enjoy your OneFS 9.4 experience!

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS

OneFS Caching Hierarchy

Nick Trimbee

Tue, 22 Mar 2022 20:05:56 -0000

|

Read Time: 0 minutes

Caching occurs in OneFS at multiple levels, and for a variety of types of data. For this discussion we’ll concentrate on the caching of file system structures in main memory and on SSD.

OneFS’ caching infrastructure design is based on aggregating each individual node’s cache into one cluster wide, globally accessible pool of memory. This is done by using an efficient messaging system, which allows all the nodes’ memory caches to be available to each and every node in the cluster.

For remote memory access, OneFS uses the Sockets Direct Protocol (SDP) over an Ethernet or Infiniband (IB) backend interconnect on the cluster. SDP provides an efficient, socket-like interface between nodes which, by using a switched star topology, ensures that remote memory addresses are only ever one hop away. While not as fast as local memory, remote memory access is still very fast due to the low latency of the backend network.

OneFS uses up to three levels of read cache, plus an NVRAM-backed write cache, or write coalescer. The first two types of read cache, level 1 (L1) and level 2 (L2), are memory (RAM) based, and analogous to the cache used in CPUs. These two cache layers are present in all PowerScale storage nodes. An optional third tier of read cache, called SmartFlash, or Level 3 cache (L3), is also configurable on nodes that contain solid state drives (SSDs). L3 cache is an eviction cache that is populated by L2 cache blocks as they are aged out from memory.

The OneFS caching subsystem is coherent across the cluster. This means that if the same content exists in the private caches of multiple nodes, this cached data is consistent across all instances. For example, consider the following scenario:

  1. Node 2 and Node 4 each have a copy of data located at an address in shared cache.
  2. Node 4, in response to a write request, invalidates node 2’s copy.
  3. Node 4 then updates the value.
  4. Node 2 must re-read the data from shared cache to get the updated value.

OneFS uses the MESI Protocol to maintain cache coherency, implementing an “invalidate-on-write” policy to ensure that all data is consistent across the entire shared cache. The various states that in-cache data can take are:

M – Modified: The data exists only in local cache, and has been changed from the value in shared cache. Modified data is referred to as ‘dirty’.

E – Exclusive: The data exists only in local cache, but matches what is in shared cache. This data referred to as ‘clean’.

S – Shared: The data in local cache may also be in other local caches in the cluster.

I – Invalid: A lock (exclusive or shared) has been lost on the data.

L1 cache, or front-end cache, is memory that is nearest to the protocol layers (such as NFS, SMB, and so on) used by clients, or initiators, connected to that node. The main task of L1 is to prefetch data from remote nodes. Data is pre-fetched per file, and this is optimized to reduce the latency associated with the nodes’ IB back-end network. Because the IB interconnect latency is relatively small, the size of L1 cache, and the typical amount of data stored per request, is less than L2 cache.

L1 is also known as remote cache because it contains data retrieved from other nodes in the cluster. It is coherent across the cluster, but is used only by the node on which it resides, and is not accessible by other nodes. Data in L1 cache on storage nodes is aggressively discarded after it is used. L1 cache uses file-based addressing, in which data is accessed by means of an offset into a file object. The L1 cache refers to memory on the same node as the initiator. It is only accessible to the local node, and typically the cache is not the primary copy of the data. This is analogous to the L1 cache on a CPU core, which may be invalidated as other cores write to main memory. L1 cache coherency is managed using a MESI-like protocol using distributed locks, as described above.

L2, or back-end cache, refers to local memory on the node on which a particular block of data is stored. L2 reduces the latency of a read operation by not requiring a seek directly from the disk drives. As such, the amount of data prefetched into L2 cache for use by remote nodes is much greater than that in L1 cache.

L2 is also known as local cache because it contains data retrieved from disk drives located on that node and then made available for requests from remote nodes. Data in L2 cache is evicted according to a Least Recently Used (LRU) algorithm. Data in L2 cache is addressed by the local node using an offset into a disk drive which is local to that node. Because the node knows where the data requested by the remote nodes is located on disk, this is a very fast way of retrieving data destined for remote nodes. A remote node accesses L2 cache by doing a lookup of the block address for a particular file object. As described above, there is no MESI invalidation necessary here and the cache is updated automatically during writes and kept coherent by the transaction system and NVRAM.

L3 cache is a subsystem that caches evicted L2 blocks on a node. Unlike L1 and L2, not all nodes or clusters have an L3 cache, because it requires solid state drives (SSDs) to be present and exclusively reserved and configured for caching use. L3 serves as a large, cost-effective way of extending a node’s read cache from gigabytes to terabytes. This allows clients to retain a larger working set of data in cache, before being forced to retrieve data from higher latency spinning disk. The L3 cache is populated with “interesting” L2 blocks dropped from memory by L2’s least recently used cache eviction algorithm. Unlike RAM based caches, because L3 is based on persistent flash storage, once the cache is populated, or warmed, it’s highly durable and persists across node reboots, and so on. L3 uses a custom log-based file system with an index of cached blocks. The SSDs provide very good random read access characteristics, such that a hit in L3 cache is not that much slower than a hit in L2.

To use multiple SSDs for cache effectively and automatically, L3 uses a consistent hashing approach to associate an L2 block address with one L3 SSD. In the event of an L3 drive failure, a portion of the cache will obviously disappear, but the remaining cache entries on other drives will still be valid. Before a new L3 drive can be added to the hash, some cache entries must be invalidated.

OneFS also uses a dedicated inode cache in which recently requested inodes are kept. The inode cache frequently has a large impact on performance, because clients often cache data, and many network I/O activities are primarily requests for file attributes and metadata, which can be quickly returned from the cached inode.

OneFS provides tools to accurately assess the performance of the various levels of cache at a point in time. These cache statistics can be viewed from the OneFS CLI using the isi_cache_stats command. Statistics for L1, L2, and L3 cache are displayed for both data and metadata. For example:

# isi_cache_stats
Totals
l1_data: a 224G 100% r 226G 100% p 318M 77%, l1_encoded: a 0.0B 0% r 0.0B 0% p 0.0B 0%, l1_meta: r 4.5T 99% p 152K 48%,
l2_data: r 1.2G 95% p 115M 79%, l2_meta: r 27G 72% p 28M 3%,
l3_data: r 0.0B 0% p 0.0B 0%, l3_meta: r 8G 99% p 0.0B 0%

For more detailed and formatted output, a verbose option of the command is available using the ‘isi_cache_stats -v’ option.

It’s worth noting that for L3 cache, the prefetch statistics will always read zero, since it’s a pure eviction cache and does not use data or metadata prefetch.

Due to balanced data distribution, automatic rebalancing, and distributed processing, OneFS is able to leverage additional CPUs, network ports, and memory as the system grows. This also allows the caching subsystem (and, by virtue, throughput and IOPS) to scale linearly with the cluster size.

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS NFS

OneFS Endurant Cache

Nick Trimbee

Tue, 22 Mar 2022 18:27:04 -0000

|

Read Time: 0 minutes

My earlier blog post on multi-threaded I/O generated several questions on synchronous writes in OneFS. So, this seemed like a useful topic to explore in a bit more detail.

OneFS natively provides a caching mechanism for synchronous writes – or writes that require a stable write acknowledgement to be returned to a client. This functionality is known as the Endurant Cache, or EC.

The EC operates in conjunction with the OneFS write cache, or coalescer, to ingest, protect, and aggregate small synchronous NFS writes. The incoming write blocks are staged to NVRAM, ensuring the integrity of the write, even during the unlikely event of a node’s power loss.  Furthermore, EC also creates multiple mirrored copies of the data, further guaranteeing protection from single node and, if desired, multiple node failures.

EC improves the latency associated with synchronous writes by reducing the time to acknowledgement back to the client. This process removes the Read-Modify-Write (R-M-W) operations from the acknowledgement latency path, while also leveraging the coalescer to optimize writes to disk. EC is also tightly coupled with OneFS’ multi-threaded I/O (Multi-writer) process, to support concurrent writes from multiple client writer threads to the same file. And the design of EC ensures that the cached writes do not impact snapshot performance.

The endurant cache uses write logging to combine and protect small writes at random offsets into 8KB linear writes. To achieve this, the writes go to special mirrored files, or ‘Logstores’. The response to a stable write request can be sent once the data is committed to the logstore. Logstores can be written to by several threads from the same node and are highly optimized to enable low-latency concurrent writes.

Note that if a write uses the EC, the coalescer must also be used. If the coalescer is disabled on a file, but EC is enabled, the coalescer will still be active with all data backed by the EC.

So what exactly does an endurant cache write sequence look like?

Say an NFS client wishes to write a file to a PowerScale cluster over NFS with the O_SYNC flag set, requiring a confirmed or synchronous write acknowledgement. Here is the sequence of events that occurs to facilitate a stable write.

1. A client, connected to node 3, begins the write process sending protocol level blocks. 4K is the optimal block size for the endurant cache.

 

2. The NFS client’s writes are temporarily stored in the write coalescer portion of node 3’s RAM. The Write Coalescer aggregates uncommitted blocks so that the OneFS can, ideally, write out full protection groups where possible, reducing latency over protocols that allow “unstable” writes. Writing to RAM has far less latency that writing directly to disk.

3. Once in the write coalescer, the endurant cache log-writer process writes mirrored copies of the data blocks in parallel to the EC Log Files.

The protection level of the mirrored EC log files is the same as that of the data being written by the NFS client.

4. When the data copies are received into the EC Log Files, a stable write exists and a write acknowledgement (ACK) is returned to the NFS client confirming the stable write has occurred. The client assumes the write is completed and can close the write session.

5. The write coalescer then processes the file just like a non-EC write at this point. The write coalescer fills and is routinely flushed as required as an asynchronous write to the block allocation manager (BAM) and the BAM safe write (BSW) path processes.

6. The file is split into 128K data stripe units (DSUs), parity protection (FEC) is calculated, and FEC stripe units (FSUs) are created.

7. The layout and write plan is then determined, and the stripe units are written to their corresponding nodes’ L2 Cache and NVRAM. The EC logfiles are cleared from NVRAM at this point. OneFS uses a Fast Invalid Path process to de-allocate the EC Log Files from NVRAM.

8. Stripe Units are then flushed to physical disk.

9. Once written to physical disk, the data stripe Unit (DSU) and FEC stripe unit (FSU) copies created during the write are cleared from NVRAM but remain in L2 cache until flushed to make room for more recently accessed data.

As far as protection goes, the number of logfile mirrors created by EC is always one more than the on-disk protection level of the file. For example:

File Protection Level

Number of EC Mirrored Copies

+1n

2

2x

3

+2n

3

+2d:1n

3

+3n

4

+3d:1n

4

+4n

5

The EC mirrors are only used if the initiator node is lost. In the unlikely event that this occurs, the participant nodes replay their EC journals and complete the writes.

If the write is an EC candidate, the data remains in the coalescer, an EC write is constructed, and the appropriate coalescer region is marked as EC. The EC write is a write into a logstore (hidden mirrored file) and the data is placed into the journal.

Assuming the journal is sufficiently empty, the write is held there (cached) and only flushed to disk when the journal is full, thereby saving additional disk activity.

An optimal workload for EC involves small-block synchronous, sequential writes – something like an audit or redo log, for example. In that case, the coalescer will accumulate a full protection group’s worth of data and be able to perform an efficient FEC write.

The happy medium is a synchronous small block type load, particularly where the I/O rate is low and the client is latency-sensitive. In this case, the latency will be reduced and, if the I/O rate is low enough, it won’t create serious pressure.

The undesirable scenario is when the cluster is already spindle-bound and the workload is such that it generates a lot of journal pressure. In this case, EC is just going to aggravate things.

So how exactly do you configure the endurant cache?

Although on by default, setting the efs.bam.ec.mode sysctl to value ‘1’ will enable the Endurant Cache:

# isi_sysctl_cluster efs.bam.ec.mode=1

EC can also be enabled and disabled per directory:

# isi set -c [on|off|endurant_all|coal_only] <directory_name>

To enable the coalescer but switch off EC, run:

# isi set -c coal_only

And to disable the endurant cache completely:

# isi_for_array –s isi_sysctl_cluster efs.bam.ec.mode=0

A return value of zero on each node from the following command will verify that EC is disabled across the cluster:

# isi_for_array –s sysctl efs.bam.ec.stats.write_blocks efs.bam.ec.stats.write_blocks: 0

If the output to this command is incrementing, EC is delivering stable writes.

Be aware that if the Endurant Cache is disabled on a cluster, the sysctl efs.bam.ec.stats.write_blocks output on each node will not return to zero, because this sysctl is a counter, not a rate. These counters won’t reset until the node is rebooted.

As mentioned previously, EC applies to stable writes, namely:

  • Writes with O_SYNC and/or O_DIRECT flags set
  • Files on synchronous NFS mounts

When it comes to analyzing any performance issues involving EC workloads, consider the following:

  • What changed with the workload?
  • If upgrading OneFS, did the prior version also have EC enabled? 

If the workload has moved to new cluster hardware:

  • Does the performance issue occur during periods of high CPU utilization?
  • Which part of the workload is creating a deluge of stable writes?
  • Was there a large change in spindle or node count?
  • Has the OneFS protection level changed?
  • Is the SSD strategy the same?

Disabling EC is typically done cluster-wide and this can adversely impact certain workflow elements. If the EC load is localized to a subset of the files being written, an alternative way to reduce the EC heat might be to disable the coalescer buffers for some particular target directories, which would be a more targeted adjustment. This can be configured using the isi set –c off command.

One of the more likely causes of performance degradation is from applications aggressively flushing over-writes and, as a result, generating a flurry of ‘commit’ operations. This can generate heavy read/modify/write (r-m-w) cycles, inflating the average disk queue depth, and resulting in significantly slower random reads. The isi statistics protocol CLI command output will indicate whether the ‘commit’ rate is high.

It’s worth noting that synchronous writes do not require using the NFS ‘sync’ mount option. Any programmer who is concerned with write persistence can simply specify an O_FSYNC or O_DIRECT flag on the open() operation to force synchronous write semantics for that file handle. With Linux, writes using O_DIRECT will be separately accounted for in the Linux ‘mountstats’ output. Although it’s almost exclusively associated with NFS, the EC code is actually protocol-agnostic. If writes are synchronous (write-through) and are either misaligned or smaller than 8k, they have the potential to trigger EC, regardless of the protocol.

The endurant cache can provide a significant latency benefit for small (such as 4K), random synchronous writes – albeit at a cost of some additional work for the system.

However, it’s worth bearing the following caveats in mind:

  • EC is not intended for more general purpose I/O.
  • There is a finite amount of EC available. As load increases, EC can potentially ‘fall behind’ and end up being a bottleneck.
  • Endurant Cache does not improve read performance, since it’s strictly part of the write process.
  • EC will not increase performance of asynchronous writes – only synchronous writes.

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS

OneFS Writes

Nick Trimbee

Mon, 14 Mar 2022 23:13:12 -0000

|

Read Time: 0 minutes

OneFS runs equally across all the nodes in a cluster such that no one node controls the cluster and all nodes are true peers. Looking from a high-level at the components within each node, the I/O stack is split into a top layer, or initiator, and a bottom layer, or participant. This division is used as a logical model for the analysis of OneFS’ read and write paths.

At a physical-level, CPUs and memory cache in the nodes are simultaneously handling initiator and participant tasks for I/O taking place throughout the cluster. There are caches and a distributed lock manager that are excluded from the diagram below for simplicity’s sake.

 

When a client connects to a node to write a file, it is connecting to the top half or initiator of that node. Files are broken into smaller logical chunks called stripes before being written to the bottom half or participant of a node (disk). Failure-safe buffering using a write coalescer is used to ensure that writes are efficient and read-modify-write operations are avoided. The size of each file chunk is referred to as the stripe unit size. OneFS stripes data across all nodes and protects the files, directories, and associated metadata via software erasure-code or mirroring.

OneFS determines the appropriate data layout to optimize for storage efficiency and performance. When a client connects to a node, that node’s initiator acts as the ‘captain’ for the write data layout of that file. Data, erasure code (FEC) protection, metadata, and inodes are all distributed on multiple nodes, and spread across multiple drives within nodes. The following figure shows a file write occurring across all nodes in a three node cluster.

OneFS uses a cluster’s Ethernet or Infiniband back-end network to allocate and automatically stripe data across all nodes. As data is written, it’s also protected at the specified level.

When writes take place, OneFS divides data out into atomic units called protection groups. Redundancy is built into protection groups, such that if every protection group is safe, then the entire file is safe. For files protected by FEC, a protection group consists of a series of data blocks as well as a set of parity blocks for reconstruction of the data blocks in the event of drive or node failure. For mirrored files, a protection group consists of all of the mirrors of a set of blocks.

OneFS is capable of switching the type of protection group used in a file dynamically, as it is writing. This allows the cluster to continue without blocking in situations when temporary node failure prevents the desired level of parity protection from being applied. In this case, mirroring can be used temporarily to allow writes to continue. When nodes are restored to the cluster, these mirrored protection groups are automatically converted back to FEC protected.

During a write, data is broken into stripe units and these are spread across multiple nodes as a protection group. As data is being laid out across the cluster, erasure codes or mirrors, as required, are distributed within each protection group to ensure that files are protected at all times.

One of the key functions of the OneFS AutoBalance job is to reallocate and balance data and, where possible, make storage space more usable and efficient. In most cases, the stripe width of larger files can be increased to take advantage of new free space, as nodes are added, and to make the on-disk layout more efficient.

The initiator top half of the ‘captain’ node uses a modified two-phase commit (2PC) transaction to safely distribute writes across the cluster, as follows:

Every node that owns blocks in a particular write operation is involved in a two-phase commit mechanism, which relies on NVRAM for journaling all the transactions that are occurring across every node in the storage cluster. Using multiple nodes’ NVRAM in parallel allows for high-throughput writes, while maintaining data safety against all manner of failure conditions, including power failures. If a node should fail mid-transaction, the transaction is restarted instantly without that node involved. When the node returns, it simply replays its journal from NVRAM.

In a write operation, the initiator also orchestrates the layout of data and metadata, the creation of erasure codes, and lock management and permissions control. OneFS can also optimize layout decisions to better suit the workflow. These access patterns, which can be configured at a per-file or directory level, include:

Concurrency: Optimizes for current load on the cluster, featuring many simultaneous clients.

Streaming: Optimizes for high-speed streaming of a single file, for example to enable very fast reading with a single client.

Random: Optimizes for unpredictable access to the file, by adjusting striping and disabling the use of prefetch.

 

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS

OneFS File Locking and Concurrent Access

Nick Trimbee

Mon, 14 Mar 2022 23:03:37 -0000

|

Read Time: 0 minutes

There has been a bevy of recent questions around how OneFS allows various clients attached to different nodes of a cluster to simultaneously read from and write to the same file. So it seemed like a good time for a quick refresher on some of the concepts and mechanics behind OneFS concurrency and distributed locking.

 

File locking is the mechanism that allows multiple users or processes to access data concurrently and safely. For reading data, this is a fairly straightforward process involving shared locks. With writes, however, things become more complex and require exclusive locking, because data must be kept consistent.

OneFS has a fully distributed lock manager that marshals locks on data across all the nodes in a storage cluster. This locking manager is highly extensible and allows for multiple lock types to support both file system locks, as well as cluster-coherent protocol-level locks, such as SMB share mode locks or NFS advisory-mode locks. OneFS supports delegated locks such as SMB oplocks and NFSv4 delegations.

Every node in a cluster can act as coordinator for locking resources, and a coordinator is assigned to lockable resources based upon a hashing algorithm. This selection algorithm is designed so that the coordinator almost always ends up on a different node than the initiator of the request. When a lock is requested for a file, it can either be a shared lock or an exclusive lock. A shared lock is primarily used for reads and allows multiple users to share the lock simultaneously. An exclusive lock, on the other hand, allows only one user access to the resource at any given moment, and is typically used for writes. Exclusive lock types include:

Mark Lock: An exclusive lock resource used to synchronize the marking and sweeping processes for the Collect job engine job.

Snapshot Lock: As the name suggests, the exclusive snapshot lock that synchronizes the process of creating and deleting snapshots.

Write Lock: An exclusive lock that’s used to quiesce writes for particular operations, including snapshot creates, non-empty directory renames, and marks.

The OneFS locking infrastructure has its own terminology, and includes the following definitions:

Domain: Refers to the specific lock attributes (recursion, deadlock detection, memory use limits, and so on) and context for a particular lock application. There is one definition of owner, resource, and lock types, and only locks within a particular domain might conflict.

Lock Type: Determines the contention among lockers. A shared or read lock does not contend with other types of shared or read locks, while an exclusive or write lock contends with all other types. Lock types include:

  • Advisory
  • Anti-virus
  • Data
  • Delete
  • LIN
  • Mark
  • Oplocks
  • Quota
  • Read
  • Share Mode
  • SMB byte-range
  • Snapshot
  • Write

Locker: Identifies the entity that acquires a lock.

Owner: A locker that has successfully acquired a particular lock. A locker may own multiple locks of the same or different type as a result of recursive locking.

Resource: Identifies a particular lock. Lock acquisition only contends on the same resource. The resource ID is typically a LIN to associate locks with files.

Waiter: Has requested a lock but has not yet been granted or acquired it.

Here’s an example of how threads from different nodes could request a lock from the coordinator:

  1. Node 2 is selected as the lock coordinator of these resources.
  2. Thread 1 from Node 4 and thread 2 from Node 3 request a shared lock on a file from Node 2 at the same time.
  3. Node 2 checks if an exclusive lock exists for the requested file.
  4. If no exclusive locks exist, Node 2 grants thread 1 from Node 4 and thread 2 from Node 3 shared locks on the requested file.
  5. Node 3 and Node 4 are now performing a read on the requested file.
  6. Thread 3 from Node 1 requests an exclusive lock for the same file as being read by Node 3 and Node 4.
  7. Node 2 checks with Node 3 and Node 4 if the shared locks can be reclaimed.
  8. Node 3 and Node 4 are still reading so Node 2 asks thread 3 from Node 1 to wait for a brief instant.
  9. Thread 3 from Node 1 blocks until the exclusive lock is granted by Node 2 and then completes the write operation.

Author: Nick Trimbee


Read Full Blog

OneFS Time Synchronization and NTP

Nick Trimbee

Fri, 11 Mar 2022 16:08:05 -0000

|

Read Time: 0 minutes

OneFS provides a network time protocol (NTP) service to ensure that all nodes in a cluster can easily be synchronized to the same time source. This service automatically adjusts a cluster’s date and time settings to that of one or more external NTP servers.

You can perform NTP configuration on a cluster using the isi ntp command line (CLI) utility, rather than modifying the nodes’ /etc/ntp.conf files manually. The syntax for this command is divided into two parts: servers and settings. For example:

# isi ntp settings
Description:
    View and modify cluster NTP configuration.
Required Privileges:
    ISI_PRIV_NTP
Usage:
    isi ntp settings <action>
        [--timeout <integer>]
        [{--help | -h}]
Actions:
    modify    Modify cluster NTP configuration.
    view      View cluster NTP configuration.
Options:
  Display Options:
    --timeout <integer>
        Number of seconds for a command timeout (specified as 'isi --timeout NNN
        <command>').
    --help | -h
        Display help for this command.

There is also an isi_ntp_config CLI command available in OneFS that provides a richer configuration set and combines the server and settings functionality:

Usage: isi_ntp_config COMMAND [ARGUMENTS ...]
Commands:
    help
      Print this help and exit.
    list
      List all configured info.
    add server SERVER [OPTION]
      Add SERVER to ntp.conf.  If ntp.conf is already
      configured for SERVER, the configuration will be replaced.
      You can specify any server option. See NTP.CONF(5)
 
    delete server SERVER
      Remove server configuration for SERVER if it exists.
   
 add exclude NODE [NODE...]
      Add NODE (or space separated nodes) to NTP excluded entry.
      Excluded nodes are not used for NTP communication with external
      NTP servers.
 
    delete exclude NODE [NODE...]
      Delete NODE (or space separated Nodes) from NTP excluded entry.
 
    keyfile KEYFILE_PATH
      Specify keyfile path for NTP auth. Specify "" to clear value.
      KEYFILE_PATH has to be a path under /ifs.
 
    chimers [COUNT | "default"]
      Display or modify the number of chimers NTP uses.
      Specify "default" to clear the value.

By default, if the cluster has more than three nodes, three of the nodes are selected as chimers. Chimers are nodes which can contact the external NTP servers. If the cluster consists of three nodes or less, only one node is selected as a chimer. If no external NTP server is set, they use the local clock instead. The other non-chimer nodes use the chimer nodes as their NTP servers. The chimer nodes are selected by the lowest node number which is not excluded from chimer duty.

If a node is configured as a chimer. its /etc/ntp.conf entry will resemble:
# This node is one of the 3 chimer nodes that can contact external NTP
# servers. The non-chimer nodes will use this node as well as the other
# chimers as their NTP servers.
server time.isilon.com
# The other chimer nodes on this cluster:
server 192.168.10.150 iburst
server 192.168.10.151 iburst
# If none or bad connection to external servers this node may become
# the time server for this cluster. The system clock will be a time
# source and run at a high stratum

Besides managing NTP servers and authentication, you can exclude individual nodes from communicating with external NTP servers.

The local clock of the node is set as an NTP server at a high stratum level. In NTP, a server with lower stratum number is preferred, so if an external NTP server is set, the system prefers an external time server if configured. The stratum level for the chimer is determined by the chimer number. The first chimer is set to stratum 9, the second to stratum 11, and the others continue to increment the stratum number by 2. This is so the non-chimer nodes prefer to get the time from the first chimer if available.

For a non-chimer node, its /etc/ntp.conf entry will resemble:

# This node is _not_ one of the 3 chimer nodes that can contact external
# NTP servers. These are the cluster's chimer nodes:
server 192.168.10.149 iburst true
server 192.168.10.150 iburst true
server 192.168.10.151 iburst true

When configuring NTP on a cluster, you can specify more than one NTP server to synchronize the system time from. This ability allows for full redundancy of ysnc targets. The cluster periodically contacts the server or servers and adjusts the time, date or both as necessary, based on the information it receives.

You can use the isi_ntp_config CLI command to configure which NTP servers a cluster will reference. For example, the following syntax adds the server time.isilon.com:

# isi_ntp_config add server time.isilon.com

Alternatively, you can manage the NTP configuration from the WebUI by going to Cluster Management > General Settings > NTP.

NTP also provides basic authentication-based security using symmetrical keys, if preferred.

If no NTP servers are available, Windows Active Directory (AD) can synchronize domain members to a primary clock running on the domain controller or controllers. If there are no external NTP servers configured and the cluster is joined to AD, OneFS uses the Windows domain controller as the NTP time server. If the cluster and domain time become out of sync by more than four minutes, OneFS generates an event notification.

Be aware that if the cluster and Active Directory drift out of time sync by more than five minutes, AD authentication will cease to function.

If both NTP server and domain controller are not available, you can manually set the cluster’s time, date and time zone using the isi config CLI command. For example:

1. Run the isi config command. The command-line prompt changes to indicate that you are in the isi config subsystem:

# isi config
Welcome to the Isilon IQ configuration console.
Copyright (c) 2001-2017 EMC Corporation. All Rights Reserved.
Enter 'help' to see list of available commands.
Enter 'help <command>' to see help for a specific command.
Enter 'quit' at any prompt to discard changes and exit.
        Node build: Isilon OneFS v8.2.2 B_8_2_2(RELEASE)Node serial number: JWXER170300301
>>> 

2. Specify the current date and time by running the date command. For example, the following command sets the cluster time to 9:20 AM on April 23, 2020:

>>> date 2020/04/23 09:20:00
Date is set to 2020/04/23 09:20:00

3. The help timezone command lists the available timezones. For example:

>>> help timezone
 
timezone [<timezone identifier>]
 
Sets the time zone on the cluster to the specified time zone.
Valid time zone identifiers are:
        Greenwich Mean Time
        Eastern Time Zone
        Central Time Zone
        Mountain Time Zone
        Pacific Time Zone
        Arizona
        Alaska
        Hawaii
        Japan
        Advanced

4. To verify the currently configured time zone, run the timezone command. For example:

>>> timezone
The current time zone is: Greenwich Mean Time

5. To change the time zone, enter the timezone command followed by one of the displayed options. For example, the following command changes the time zone to Alaska:

>>> timezone Alaska
Time zone is set to Alaska

A message confirming the new time zone setting displays. If your preferred time zone did not display when you ran the help timezone command, enter timezone Advanced. After a warning screen displays, you will see a list of regions. When you select a region, a list of specific time zones for that region appears. Select the preferred time zone (you may need to scroll), and enter OK or Cancel until you return to the isi config prompt.

6. When done, run the commit command to save your changes and exit isi config.

>>> commit
Commit succeeded.

Alternatively, you can manage these time and date parameters from the WebUI by going to Cluster Management > General Settings > Date and Time.


Author: Nick Trimbee

Read Full Blog
PowerScale OneFS

OneFS Multi-writer

Nick Trimbee

Fri, 04 Mar 2022 21:02:14 -0000

|

Read Time: 0 minutes

In one of my other blog articles, we looked at write locking and shared access in OneFS. Next, we’ll delve another layer deeper into OneFS concurrent file access.

The OneFS locking hierarchy also provides a mechanism called Multi-writer, which allows a cluster to support concurrent writes from multiple client writer threads to the same file. This granular write locking is achieved by sub-dividing the file into separate regions and granting exclusive data write locks to these individual ranges, as opposed to the entire file. This process allows multiple clients, or write threads, attached to a node to simultaneously write to different regions of the same file.

Concurrent writes to a single file need more than just supporting data locks for ranges. Each writer also needs to update a file’s metadata attributes such as timestamps or block count. A mechanism for managing inode consistency is also needed, since OneFS is based on the concept of a single inode lock per file.

In addition to the standard shared read and exclusive write locks, OneFS also provides the following locking primitives, through journal deltas, to allow multiple threads to simultaneously read or write a file’s metadata attributes:

OneFS Lock Types include:

Exclusive: A thread can read or modify any field in the inode. When the transaction is committed, the entire inode block is written to disk, along with any extended attribute blocks.

Shared: A thread can read, but not modify, any inode field.

DeltaWrite: A thread can modify any inode fields which support delta-writes. These operations are sent to the journal as a set of deltas when the transaction is committed.

DeltaRead: A thread can read any field which cannot be modified by inode deltas.

These locks allow separate threads to have a Shared lock on the same LIN, or for different threads to have a DeltaWrite lock on the same LIN. However, it is not possible for one thread to have a Shared lock and another to have a DeltaWrite. This is because the Shared thread cannot perform a coherent read of a field which is in the process of being modified by the DeltaWrite thread.

The DeltaRead lock is compatible with both the Shared and DeltaWrite lock. Typically the file system will attempt to take a DeltaRead lock for a read operation, and a DeltaWrite lock for a write, since this allows maximum concurrency, as all these locks are compatible.

Here’s what the write lock incompatibilities looks like:

OneFS protects data by writing file blocks (restriping) across multiple drives on different nodes. The Job Engine defines a restripe set comprising jobs which involve file-system management, protection and on-disk layout. The restripe set contains the following jobs:

  • AutoBalance & AutoBalanceLin
  • FlexProtect & FlexProtectLin
  • MediaScan
  • MultiScan
  • SetProtectPlus
  • SmartPools
  • Upgrade

Multi-writer for restripe allows multiple restripe worker threads to operate on a single file concurrently. This, in turn, improves read/write performance during file re-protection operations, plus helps reduce the window of risk (MTTDL) during drive Smartfails or other failures. This is particularly true for workflows consisting of large files, while one of the above restripe jobs is running. Typically, the larger the files on the cluster, the more benefit multi-writer for restripe will offer.

With multi-writer for restripe, an exclusive lock is no longer required on the LIN during the actual restripe of data. Instead, OneFS tries to use a delta write lock to update the cursors used to track which parts of the file need restriping. This means that a client application or program should be able to continue to write to the file while the restripe operation is underway.

An exclusive lock is only required for a very short period of time while a file is set up to be restriped. A file will have fixed widths for each restripe lock, and the number of range locks will depend on the quantity of threads and nodes which are actively restriping a single file.

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS SmartPools

OneFS FilePolicy Job

Nick Trimbee

Fri, 04 Mar 2022 15:25:02 -0000

|

Read Time: 0 minutes

Traditionally, OneFS has used the SmartPools jobs to apply its file pool policies. To accomplish this, the SmartPools job visits every file, and the SmartPoolsTree job visits a tree of files. However, the scanning portion of these jobs can result in significant random impact to the cluster and lengthy execution times, particularly in the case of SmartPools job.

To address this, OneFS also provides the FilePolicy job, which offers a faster, lower impact method for applying file pool policies than the full-blown SmartPools job.

But first, a quick Job Engine refresher…

As we know, the Job Engine is OneFS’ parallel task scheduling framework, and is responsible for the distribution, execution, and impact management of critical jobs and operations across the entire cluster.

The OneFS Job Engine schedules and manages all the data protection and background cluster tasks: creating jobs for each task, prioritizing them, and ensuring that inter-node communication and cluster wide capacity utilization and performance are balanced and optimized. Job Engine ensures that core cluster functions have priority over less important work and gives applications integrated with OneFS – Isilon add-on software or applications integrating to OneFS via the OneFS API – the ability to control the priority of their various functions to ensure the best resource utilization.

Each job (such as the SmartPools job) has an “Impact Profile” comprising a configurable policy and a schedule that characterizes how much of the system’s resources the job will take – plus an Impact Policy and an Impact Schedule. The amount of work a job has to do is fixed, but the resources dedicated to that work can be tuned to minimize the impact to other cluster functions, like serving client data.

Here’s a list of the specific jobs that are directly associated with OneFS SmartPools:

Job

Description

SmartPools

Job that runs and moves data between the tiers of nodes within the same cluster. Also executes the CloudPools functionality if licensed and configured.

SmartPoolsTree

Enforces SmartPools file policies on a subtree.

FilePolicy

Efficient changelist-based SmartPools file pool policy job.

IndexUpdate

Creates and updates an efficient file system index for FilePolicy job.

SetProtectPlus

Applies the default file policy. This job is disabled if SmartPools is activated on the cluster.

In conjunction with the IndexUpdate job, FilePolicy improves job scan performance by using a ‘file system index’, or changelist, to find files needing policy changes, rather than a full tree scan.

Avoiding a full treewalk dramatically decreases the amount of locking and metadata scanning work the job is required to perform, reducing impact on CPU and disk – albeit at the expense of not doing everything that SmartPools does. The FilePolicy job enforces just the SmartPools file pool policies, as opposed to the storage pool settings. For example, FilePolicy does not deal with changes to storage pools or storage pool settings, such as:

  • Restriping activity due to adding, removing, or reorganizing node pools
  • Changes to storage pool settings or defaults, including protection

However, the majority of the time SmartPools and FilePolicy perform the same work. Disabled by default, FilePolicy supports the full range of file pool policy features, reports the same information, and provides the same configuration options as the SmartPools job. Because FilePolicy is a changelist-based job, it performs best when run frequently – once or multiple times a day, depending on the configured file pool policies, data size, and rate of change.

Job schedules can easily be configured from the OneFS WebUI by navigating to Cluster Management > Job Operations, highlighting the desired job, and selecting ‘View\Edit’. The following example illustrates configuring the IndexUpdate job to run every six hours at a LOW impact level with a priority value of 5:

When enabling and using the FilePolicy and IndexUpdate jobs, the recommendation is to continue running the SmartPools job as well, but at a reduced frequency (monthly).

In addition to running on a configured schedule, the FilePolicy job can also be executed manually.

FilePolicy requires access to a current index. If the IndexUpdate job has not yet been run, attempting to start the FilePolicy job will fail with the error shown in the following figure. Instructions in the error message are displayed, prompting to run the IndexUpdate job first. When the index has been created, the FilePolicy job will run successfully. The IndexUpdate job can be run several times daily (that is, every six hours) to keep the index current and prevent the snapshots from getting large.

Consider using the FilePolicy job with the job schedule shown in the table below for workflows and datasets that have these characteristics:

  • Data with long retention times
  • Large number of small files
  • Path-based File Pool filters configured
  • Where FSAnalyze job is already running on the cluster (InsightIQ monitored clusters)
  • There is already a SnapshotIQ schedule configured
  • When the SmartPools job typically takes a day or more to run to completion at LOW impact

For clusters without these characteristics, the recommendation is to continue running the SmartPools job as usual and not to activate the FilePolicy job.

The following table provides a suggested job schedule when deploying FilePolicy:

Job

Schedule

Impact

Priority

FilePolicy

Every day at 22:00

LOW

6

IndexUpdate

Every six hours, every day

LOW

5

SmartPools

Monthly – Sunday at 23:00

LOW

6

Because no two clusters are the same, this suggested job schedule may require additional tuning to meet the needs of a specific environment.

Note that when clusters running older OneFS versions and the FSAnalyze job are upgraded to OneFS 8.2.x or later, the legacy FSAnalyze index and snapshots are removed and replaced by new snapshots the first time that IndexUpdate is run. The new index stores considerably more file and snapshot attributes than the old FSA index. Until the IndexUpdate job effects this change, FSA keeps running on the old index and snapshots.

Author: Nick Trimbee



Read Full Blog
data storage data tiering PowerScale API OneFS

A Metadata-based Approach to Tiering in PowerScale OneFS

Gregory Shiff

Wed, 02 Mar 2022 22:56:32 -0000

|

Read Time: 0 minutes

OneFS SmartPools provides sophisticated tiering between storage node types. Rules based on file attributes such as last accessed time or creation date can be configured in OneFS to drive transparent motion of data between PowerScale node types. This kind of “set and forget” approach to data tiering is ideal for some industries but not workable for most content creation workflows.

A classic case of how this kind of tiering falls short for media is the real-time nature of video playback. For an extreme example, take an uncompressed 4K image sequence (or even 8K), that might require >1.5GB/s of throughput to play properly. If this media has been tiered down to low performing archive storage and it needs to be used, those files must be migrated back up before they will play. This problem causes delays and confusion all around and makes media storage administrators hesitant to archive anything.

The good news is that the PowerScale OneFS ecosystem has a better way of doing things!

The approach I have taken here is to pull metadata from elsewhere in the workflow and use it to drive on demand tiering in OneFS. How does that work? OneFS supports file extended attributes, which are <key/value> pairs (metadata!) that can be written to the files and directories stored in OneFS. File Policies can be configured in OneFS to move data based on those file extended attributes. And a SmartPoolsTree job can be run on only the path that needs to be moved. All this goodness can be controlled externally by combining the DataIQ API and the OneFS API.

Figure 1: API flow

Note that while I’m focused on combining the DataIQ and OneFS APIs in this post, other API driven tools with OneFS file system visibility could be substituted for DataIQ.

DataIQ

DataIQ is a data indexing and analysis tool. It runs as an external virtual machine and maintains an index of mounted file systems. DataIQ’s file system crawler is efficient, fast, and lightweight, meaning it can be kept up to date with little impact on the storage devices it is indexing.

DataIQ has a concept called “tagging”. Tags in DataIQ apply to directories and provide a mechanism for reporting sets of related data. A tag in DataIQ is an arbitrary <key>/<value> pair. Directories can be tagged in DataIQ in three different ways:

  • Autotagging rules:
    1. Tags are automatically placed in the file system based on regular expressions defined in the Autotagging configuration menu.
  • Use of .cntag files:
    1. Empty files named in the format <key>.<value>.cntag are placed in directories and will be recognized as tags by DataIQ.
  • API-based tagging:
    1. The DataIQ API allows for external tagging of directories.

Tags can be placed throughout a file system and then reported on as a group. For instance, temporary render directories could contain a render.temp.cntag file. Similarly, an external tool could access the DataIQ API and place a <Project/Name> tag on the top-level directory of each project. DataIQ can generate reports on the storage capacity those tags are consuming.

File system extended attributes in OneFS

As I mentioned earlier, OneFS supports file extended attributes. Extended attributes are arbitrary metadata tags in the form of <key/value> pairs that can be applied to files and directories. Extended attributes are not visible in the graphical interface or when accessing files over a share or export. However, the attributes can be accessed using the OneFS CLI with the getexattr and setextattr commands.

Figure 2: File extended attributes

The SmartPools job engine will move data between node pools based on these file attributes. And it is that SmartPools functionality that uses this metadata to perform on demand data tiering.

Crucially, OneFS supports creation of file system extended attributes from an external script using the OneFS REST API. The OneFS API Reference Guide has great information about setting and reading back file system extended attributes.

Figure 3: File policy configuration

Tiering example with Autodesk Shotgrid, DataIQ, and OneFS

Autodesk ShotGrid (formerly Shotgun) is a production resource management tool common in the visual effects and animation industries. ShotGrid is a cloud-based tool that allows for coordination of large production teams. Although it isn’t a storage management tool, its business logic can be useful in deciding what tier of storage a particular set of files should live on. For instance, if a shot tracked in ShotGrid is complete and delivered, the files associated with that shot could be moved to archive.

DataIQ plug-in for Autodesk ShotGrid

The open-source DataIQ plug-in for ShotGrid is available on GitHub here:

Dell DataIQ Autodesk ShotGrid Plugin

This plug-in is proof of concept code to show how the ShotGrid and DataIQ APIs can be combined to tag data in DataIQ based on shot status in ShotGrid. The DataIQ tags are dynamically updated with the current shot status in ShotGrid.

Here is a “shot” in ShotGrid configured with various possible statuses:

Figure 4: ShotGrid status

The following figure of DataIQ shows where the shot status field from ShotGrid has been automatically applied as a tag in DataIQ.

Figure 5: DataIQ tags

Once metadata from ShotGrid has been pulled into DataIQ, that information can be used to drive OneFS SmartPools tiering:

  1. A user (or system) passes the DataIQ tag <key/values> to the DataIQ API. The DataIQ API returns a list of directories associated with that tag.
  2. A directory chosen from Step 1 above can be passed back to the DataIQ API to get a listing of all contents by way of the DataIQ file index.
  3. Those items are passed programmatically to the OneFS API. The <key/value> pair of the original DataIQ tag is written as an extended attribute directly to the targeted files and directories.  
  4. And finally, the SmartPoolsTree job can be run on the parent path chosen in Step 2 above to begin tiering the data immediately. 

Using business logic to drive storage tiering

DataIQ and OneFS provide the APIs necessary to drive storage tiering based on business logic. Striking efficiencies can be gained by taking advantage of the metadata that exists in many workflow tools. It is a matter of “connecting the dots”.

The example in this blog uses ShotGrid and DataIQ, however it is easy to imagine that similar metadata-based techniques could be developed using other file system index tools. In the media and entertainment ecosystem, media asset management and production asset management systems immediately come to mind as candidates for this kind of API level integration.

As data volumes increase exponentially, it is unrealistic to keep all files on the highest costing tiers of storage. Various automated storage tiering approaches have been around for years, but for many use cases this automated tiering approach falls short. Bringing together rich metadata and an API driven workflow bridges the gap.

To see the Python required to put this process together, refer to my white paper PowerScale OneFS: A Metadata Driven Approach to On Demand Tiering.

Author: Gregory Shiff, Principal Solutions Architect, Media & Entertainment    LinkedIn


Read Full Blog
PowerScale OneFS syslog protocol

Understanding the Protocol Syslog Format in PowerScale OneFS

Vincent Shen

Tue, 22 Feb 2022 17:04:48 -0000

|

Read Time: 0 minutes

Recently I’ve received several queries on the format of the audit protocol syslog in PowerScale. It is a little bit complicated for the following reasons:

  1. For different protocol operations (such as OPEN and CLOSE), various fields have been defined to meet auditing goals.
  2. Some fields are easy to parse and some are more difficult.
  3. It is not currently documented.

Syslog format

The following table shows the details of the format of the syslog protocol in PowerScale. (This table is very wide. Extend your browser to show all 13 fields.):

Operation

Field 1

Field 2

Field 3

Field 4

Field 5

Field 6

Field 7

Field 8

Field 9

Field 10

Field 11

Field 12

Field 13

LOGON

userSID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

username

 

 

 

 

 

LOGOFF

userSID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

username

 

 

 

 

 

TREE-CONNECT

userSID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

 

 

 

 

 

 

READ

userSID

userID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

isDirectory

inode/lin

filename

 

 

WRITE

userSID

userID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

isDirectory

inode/lin

filename

 

 

CLOSE

userSID

userID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

isDirectory

bytesRead

bytesWrite

inode/lin

filename

DELETE

userSID

userID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

isDirectory

inode/lin

filename

 

 

GET_SECURITY

userSID

userID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

isDirectory

inode/lin

filename

 

 

SET_SECURITY

userSID

userID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

isDirectory

inode/lin

filename

 

 

OPEN

userSID

userID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

desiredAccess

isDirectory

createResult

inode/lin

filename

RENAME

userSID

userID

zoneName

ZoneID

clientIPAddr

protocol

operation

ntStatus

isDirectory

inode/lin

filename

newFileName 

 

Some Notes:

  1. Starting with OneFS 9.2.0.0, we apply the RFC 5425 as the standard of the syslog protocol.
  2. userSID: UserSID is a unique identifier for an object in Active Directory or NT4 domains. On a native Windows file server (as well as some other CIFS server implementations), this SID is used directly to determine a user's identity, and is generally stored on every file or folder in the file system that the user has rights to. SIDs commonly start with the letter `S', and include a series of numbers and dashes.
  3. userID: On most UNIX based systems, file and folder permissions are assigned to UIDs and GIDs (most commonly found in /etc/passwd and /etc/group).
  4. protocol: it’s one of the following:
    1. SMB
    2. NFS
    3. HDFS

      SMB is also returned for the LOGON, LOGOFF, and TREE-CONNECT operations.

  5. ntStatus:

  1. If the ntStatus field is 0, it will return “SUCCESS”.
  2. If the ntStatus field is non-zero, it will return “FAILD: <NT Status Code>”.
  3. If the ntStatus field is not in the payload, it will return “ERROR”.
  4. You can refer to the Microsoft Open Specifications (https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55) for the value and description of the NT Status Code.

  6. isDirectory:

  1. If it’s a file, it will return “FILE”.
  2. If it’s a directory, it will return “DIR”.

Example

 

Conclusion

I hope you have found this helpful.

Thanks for reading!

Author: Vincent Shen





Read Full Blog
PowerScale OneFS

OneFS and Long Filenames

Nick Trimbee

Fri, 28 Jan 2022 21:24:39 -0000

|

Read Time: 0 minutes

Another feature debut in OneFS 9.3 is support for long filenames. Until now, the OneFS filename limit has been capped at 255 bytes. However, depending on the encoding type, this could potentially be an impediment for certain languages such as Chinese, Hebrew, Japanese, Korean, and Thai, and can create issues for customers who work with international languages that use multi-byte UTF-8 characters.

Since some international languages use up to 4 bytes per character, a file name of 255 bytes could be limited to as few as 63 characters when using certain languages on a cluster.

To address this, the new long filenames feature provides support for names up to 255 Unicode characters, by increasing the maximum file name length from 255 bytes to 1024 bytes. In conjunction with this, the OneFS maximum path length is also increased from 1024 bytes to 4096 bytes.

Before creating a name length configuration, the cluster must be running OneFS 9.3. However, the long filename feature is not activated or enabled by default. You have to opt-in by creating a “name length” configuration. That said, the recommendation is to only enable long filename support if you are actually planning on using it. This is because, once enabled, OneFS does not track if, when, or where, a long file name or path is created.

The following procedure can be used to configure a PowerScale cluster for long filename support:

Step 1: Ensure cluster is running OneFS 9.3 or later

The ‘uname’ CLI command output will display a cluster’s current OneFS version.

For example:

# uname -sr
Isilon OneFS v9.3.0.0

The current OneFS version information is also displayed at the upper right of any of the OneFS WebUI pages. If the output from Step 1 shows the cluster running an earlier release, an upgrade to OneFS 9.3 will be required. This can be accomplished either using the ‘isi upgrade cluster’ CLI command or from the OneFS WebUI, by going to Cluster Management > upgrade.

Once the upgrade has completed it will need to be committed, either by following the WebUI prompts, or using the ‘isi upgrade cluster commit’ CLI command.

Step 2. Verify cluster’s long filename support configuration: Viewing a cluster’s long filename support settings

The ‘isi namelength list’ CLI command output will verify a cluster’s long filename support status. For example, the following cluster already has long filename support enabled on the /ifs/tst path:

# isi namelength list
Path     Policy     Max Bytes   Max Chars
-----------------------------------------
/ifs/tst restricted 255         255
-----------------------------------------
Total: 1

Step 3. Configure long filename support

The ‘isi namelength create <path>’ CLI command can be run on the cluster to enable long filename support.

# mkdir /ifs/lfn
# isi namelength create --max-bytes 1024 --max-chars 1024 /ifs/lfn

By default, namelength support is created with default maximum values of 255 bytes in length and 255 characters.

Step 4: Confirm long filename support is configured

The ‘isi namelength list’ CLI command output will confirm that the cluster’s /ifs/lfn directory path is now configured to support long filenames:

# isi namelength list
Path     Policy     Max Bytes   Max Chars
-----------------------------------------
/ifs/lfn custom      1024       1024
/ifs/tst restricted 255         255
-----------------------------------------
Total: 2

Name length configuration is set up per directory and can be nested. Plus, cluster-wide configuration can be applied by configuring at the root /ifs level.

Filename length configurations have two defaults:

  • “Full” – which is 1024 bytes, 255 characters.
  • “Restricted” – which is 255 bytes, 255 characters, and the default if no long additional filename configuration is specified.

Note that removing the long name configuration for a directory will not affect its contents, including any previously created files and directories with long names. However, it will prevent any new long-named files or subdirectories from being created under that directory.

If a filename is too long for a particular protocol, OneFS will automatically truncate the name to around 249 bytes with a ‘hash’ appended to it, which can be used to consistently identify and access the file. This shortening process is referred to as ‘name mangling’. If, for example, a filename longer than 255 bytes is returned in a directory listing over NFSv3, the file’s mangled name will be presented. Any subsequent lookups of this mangled name will resolve to the same file with the original long name. Be aware that filename extensions will be lost when a name is mangled, which can have ramifications for Windows applications, and so on.

If long filename support is enabled on a cluster with active SyncIQ policies, all source and target clusters must have OneFS 9.3 or later installed and committed, and long filename support enabled.

However, the long name configuration does not need to be identical between the source and target clusters -- it only needs to be enabled. This can be done via the following sysctl command:

# sysctl efs.bam.long_file_name_enabled=1

When the target cluster for a Sync policy does not support long file names for a SyncIQ policy and the source domain has long file names enabled, the replication job will fail. The subsequent SyncIQ job report will include the following error message:

Note that the OneFS checks are unable to identify a cascaded replication target running an earlier OneFS version and/or without long filenames configured.

So there are a couple of things to bear in mind when using long filenames:

  • Restoring data from a 9.3 NDMP backup containing long filenames to a cluster running an earlier OneFS version will fail with an ‘ENAMETOOLONG’ error for each long-named file. However, all the files with regular length names will be successfully restored from the backup stream.
  • OneFS ICAP does not support long filenames. However CAVA, ICAP’s replacement, is compatible.
  • The ‘isi_vol_copy’ migration utility does not support long filenames.
  • Neither does the OneFS WebDAV protocol implementation.
  • Symbolic links created via SMB are limited to 1024 bytes due to the size limit on extended attributes.
  • Any pathnames specified in long filename pAPI operations are limited to 4068 bytes.
  • And finally, while an increase in long named files and directories could potentially reduce the number of names the OneFS metadata structures can hold, the overall performance impact of creating files with longer names is negligible.

Author: Nick Trimbee




Read Full Blog
PowerScale OneFS

OneFS Virtual Hot Spare

Nick Trimbee

Fri, 28 Jan 2022 21:12:37 -0000

|

Read Time: 0 minutes

There have been several recent questions from the field around how a cluster manages space reservation and pre-allocation of capacity for data repair and drive rebuilds.

OneFS provides a mechanism called Virtual Hot Spare (VHS), which helps ensure that node pools maintain enough free space to successfully re-protect data in the event of drive failure.

Although globally configured, Virtual Hot Spare actually operates at the node pool level so that nodes with different size drives reserve the appropriate VHS space. This helps ensure that, while data may move from one disk pool to another during repair, it remains on the same class of storage. VHS reservations are cluster wide and configurable as either a percentage of total storage (0-20%) or as a number of virtual drives (1-4). To achieve this, the reservation mechanism allocates a fraction of the node pool’s VHS space in each of its constituent disk pools.

No space is reserved for VHS on SSDs unless the entire node pool consists of SSDs. This means that a failed SSD may have data moved to HDDs during repair, but without adding additional configuration settings. This avoids reserving an unreasonable percentage of the SSD space in a node pool.

The default for new clusters is for Virtual Hot Spare to have both “subtract the space reserved for the virtual hot spare…” and “deny new data writes…” enabled with one virtual drive. On upgrade, existing settings are maintained.

It is strongly encouraged to keep Virtual Hot Spare enabled on a cluster, and a best practice is to configure 10% of total storage for VHS. If VHS is disabled and you upgrade OneFS, VHS will remain disabled. If VHS is disabled on your cluster, first check to ensure the cluster has sufficient free space to safely enable VHS, and then enable it.

VHS can be configured via the OneFS WebUI, and is always available, regardless of whether SmartPools has been licensed on a cluster. For example:

 

From the CLI, the cluster’s VHS configuration is part of the storage pool settings, and can be viewed with the following syntax:

# isi storagepool settings view
     Automatically Manage Protection: files_at_default
Automatically Manage Io Optimization: files_at_default
Protect Directories One Level Higher: Yes
       Global Namespace Acceleration: disabled
       Virtual Hot Spare Deny Writes: Yes
        Virtual Hot Spare Hide Spare: Yes
      Virtual Hot Spare Limit Drives: 1
     Virtual Hot Spare Limit Percent: 10
             Global Spillover Target: anywhere
                    Spillover Enabled: Yes
        SSD L3 Cache Default Enabled: Yes
                     SSD Qab Mirrors: one
            SSD System Btree Mirrors: one
            SSD System Delta Mirrors: one

Similarly, the following command will set the cluster’s VHS space reservation to 10%.

# isi storagepool settings modify --virtual-hot-spare-limit-percent 10

Bear in mind that reservations for virtual hot sparing will affect spillover. For example, if VHS is configured to reserve 10% of a pool’s capacity, spillover will occur at 90% full.

Spillover allows data that is being sent to a full pool to be diverted to an alternate pool. Spillover is enabled by default on clusters that have more than one pool. If you have a SmartPools license on the cluster, you can disable Spillover. However, it is recommended that you keep Spillover enabled. If a pool is full and Spillover is disabled, you might get a “no space available” error but still have a large amount of space left on the cluster.

If the cluster is inadvertently configured to allow data writes to the reserved VHS space, the following informational warning will be displayed in the SmartPools WebUI:

There is also no requirement for reserved space for snapshots in OneFS. Snapshots can use as much or little of the available file system space as desirable and necessary.

A snapshot reserve can be configured if preferred, although this will be an accounting reservation rather than a hard limit and is not a recommend best practice. If desired, snapshot reserve can be set via the OneFS command line interface (CLI) by running the ‘isi snapshot settings modify –reserve’ command.

For example, the following command will set the snapshot reserve to 10%:

# isi snapshot settings modify --reserve 10

It’s worth noting that the snapshot reserve does not constrain the amount of space that snapshots can use on the cluster. Snapshots can consume a greater percentage of storage capacity specified by the snapshot reserve.

Additionally, when using SmartPools, snapshots can be stored on a different node pool or tier than the one the original data resides on.

For example, as above, the snapshots taken on a performance aligned tier can be physically housed on a more cost effective archive tier.

Author: Nick Trimbee

Read Full Blog
security PowerScale OneFS MFA

Configure SSH Multi-Factor Authentication on OneFS Using Duo

Lieven Lin

Thu, 27 Jan 2022 21:03:07 -0000

|

Read Time: 0 minutes

Duo Security at Cisco is a vendor of cloud-based multi-factor authentication (MFA) services. MFA enables security to prevent a hacker from masquerading as an authenticated user. Duo allows an administrator to require multiple options for secondary authentication. With multi-factor authentication, even though a hacker steals the username and password, he cannot be authenticated to a network service easily without a user’s device.

SSH Multi-Factor Authentication (MFA) with Duo is a new feature introduced in OneFS 8.2. Currently, OneFS supports the SSH MFA with Duo service through SMS (short message service), phone callback, and Push notification via the Duo app. This blog describes how to integrate OneFS SSH MFA with the Duo service.

Duo supports many kinds of applications, such as Microsoft Azure Active Directory, Cisco Webex, and Amazon Web Services. For a OneFS cluster, it appears as a "Unix Application" entry. To integrate OneFS with the Duo service, you must configure the Duo service and the OneFS cluster. Before configuring OneFS with Duo, you need to have Duo account. In this blog, we used a trial version account for demonstration purposes.

Failback mode

By default, the SSH failback mode for Duo in OneFS is “safe”, which allows common authentication if the Duo service is not available. The “secure” mode will deny SSH access if the Duo service is not available, including the bypass users, because the bypass users are defined and validated in the Duo service. To configure the failback mode in OneFS, specify the --failmode option using the following command:

# isi auth duo modify --failmode 

Exclusion group

By default, all groups are required to use Duo unless the group is configured to bypass Duo auth. The groups option allows you to exclude or specify dedicated user groups from using Duo service authentication. This method provides a way to configure users so they can still SSH into the cluster even when the Duo service is not available and failback mode is set to “secure”. Otherwise, all users may be locked out of the cluster in this situation.

To configure the exclusion group option, add an exclamation character “!” before the group name and preceded by an asterisk to ensure that all other groups use Duo service. For example:

# isi auth duo modify --groups=”*,!groupname”

Note: zsh shell requires the “!” to be escaped. In this case, the example above should be changed to:

# isi auth duo modify --groups=”*,\!groupname”

Prepare the Duo service for OneFS

1. Use your new Duo account to log into the Duo Admin Panel. Select the Application item from the left menu, then click Protect an Application, as shown in Figure 1.

Figure 1  Protect an Application

2.  Type “Unix Application” in the search bar. Click Protect this Application to create a new UNIX Application entry.

Figure 2  Search for UNIX Application

3. Scroll down the creation page to find the Settings section. Type a name for the new UNIX Application. Try to use a name that can recognize your OneFS cluster, as shown in Figure 3. In the Settings section, you can also find the Duo’s name normalization setting. 

By default, Duo username normalization is not AD aware. This means that it will alter incoming usernames before trying to match them to a user account. For example, "DOMAIN\username", "username@domain.com", and "username" are treated as the same user. For other options, refer to here.

Figure 3  UNIX Application Name

4. Check the required information for OneFS under the Details section, including API hostname, integration key, and secret key, as shown in Figure 4.

Figure 4  Required Information for OneFS

5. Manually enroll a user. In this example, we are creating a user named admin, which is the default OneFS administrator user. Switch the menu item to Users and click the Add User button, as shown in Figure 5. For details about user enrollment in the Duo service, refer to the Duo documentation Enrolling Users.

Figure 5  User Enrollment

6. Type the user name, as shown in Figure 6.

Figure 6  Manually User Enrollment

7. Find the Phones settings in the user page and click the Add Phone button to add a device for the user. See Figure 7.

Figure 7  Add Phone for User

8. Type your phone number.

Figure 8  Add New Phone

9. (optional) If you want to use Duo push authentication methods, you need to install the Duo Mobile app in the phone and activate the Duo Mobile app. As highlighted in Figure 9, click the link to activate the Duo Mobile app.

Figure 9  Activate the Duo Mobile app

The Duo service is now prepared for OneFS. Now let's go on to configure OneFS.

Configuring and verifying OneFS

1. By default, the authentication setting template is set for “any”. To use OneFS with the Duo service, the authentication setting template must not be set to “any” or “custom”. It should be set to “password”, “publickey”, or “both”. In the following example, we are configuring the setting to “password”, which will use user password and Duo for SSH MFA.

# isi ssh modify --auth-settings-template=password

2. To confirm the authentication method, use the following command:

# isi ssh settings view| grep "Auth Settings Template"
      Auth Settings Template: password

3. Configure the required Duo service information and enable it for SSH MFA, as shown here. Use the same information as when we set up the UNIX Application in Duo, including API hostname, integration key, and secret key.

# isi auth duo modify --enabled=true --failmode=safe --host=api-13b1ee8c.duosecurity.com --ikey=DIRHW4IRSC7Q4R1YQ3CQ --set-skey
Enter skey:
Confirm:

4. Verify SSH MFA using the user “admin”. An SMS passcode and the user’s password are used for authentication in this example, as shown in Figure 10.

Figure 10 SSH MFA Verification

You have now completed the configuration on your Duo service portal and OneFS cluster as well! SSH users have to be authenticated with Duo, therefore, you can further strengthen your OneFS cluster security with MFA enabled.

Author: Lieven Lin

 



Read Full Blog
backup PowerScale CPU OneFS SAN

Introducing the Accelerator Nodes – the Latest Additions to the Dell PowerScale Family

Cris Banson

Thu, 20 Jan 2022 14:45:39 -0000

|

Read Time: 0 minutes

The Dell PowerScale family announced a recent addition with the latest release of accelerator nodes. Accelerator nodes contribute additional CPU, memory, and network bandwidth to a cluster that already has adequate storage resources.

The PowerScale accelerator nodes include the PowerScale P100 performance accelerator and the PowerScale B100 backup accelerator. Both the P100 and B100 are based on 1U PowerEdge R640 servers and can be part of a PowerScale cluster that is powered by OneFS 9.3 or later. The accelerator nodes contain boot media only and are optimized for CPU/memory configurations. A single P100 or B100 node can be added to a cluster. Expansion is through single node increments.

PowerScale all-flash and all-NVMe storage deliver the necessary performance to meet demanding workloads. If additional capabilities are required, new nodes can be non-disruptively added to the cluster, to provide both performance and capacity. There may be specialized compute-bound workloads that require extra performance but don’t need any additional capacity. These types of workloads may benefit by adding the PowerScale P100 performance accelerator node to the cluster. The accelerator node contributes CPU, memory, and network bandwidth capabilities to the cluster. This accelerated storage solution delivers incremental performance at a lower cost. Let’s look at each in detail.  

A PowerScale P100 Performance Accelerator node adds performance to the workflows on a PowerScale cluster that is composed of CPU-bound nodes. The P100 provides a dedicated cache, separate from the cluster. Adding CPU to the cluster will improve performance where there are read/re-read intensive workloads. The P100 also provides additional network bandwidth to a cluster through the additional front-end ports.

With rapid data growth, organizations are challenged by shrinking backup windows that impact business productivity and the ability to meet IT requirements for tape backup, and compliance archiving. In such an environment, providing fast, efficient, and reliable data protection is essential. Given the 24x7 nature of the business, a high-performance backup solution delivers the performance and scale to address the SLAs of the business. Adding one or more PowerScale B100 backup accelerator nodes to a PowerScale cluster can reduce risk while addressing backup protection needs. 

A PowerScale B100 Backup Accelerator enables backing up a PowerScale cluster using a two-way NDMP protocol. The B100 is delivered in a cost-effective form factor to address the SLA targets and tape backup needs of a wide variety of workloads. Each node includes Fibre Channel ports that can connect directly to a tape subsystem or a Storage Area Network (SAN). The B100 can benefit backup operations as it reduces overhead on the cluster, by going through the Fibre Channel ports directly, thereby separating front-end and NDMP traffic.

The PowerScale P100 and B100 nodes can be monitored using the same tools available today, including the OneFS web administration interface, the OneFS command-line interface, Dell DataIQ, and InsightIQ.

In a world where unstructured data is growing rapidly and taking over the data center, organizations need an enterprise storage solution that provides the flexibility to address the additional performance needs of certain workloads, and that meets the organization’s overall data protection requirements. 

The following information provides the technical specifications and best practice design considerations of the PowerScale Accelerator nodes:

Author: Cris Banson


Read Full Blog
data storage Isilon PowerScale

OneFS & Files Per Directory

Nick Trimbee

Thu, 13 Jan 2022 15:00:46 -0000

|

Read Time: 0 minutes

Had several recent inquiries from the field recently asking about the low impact methods to count the number of files in large directories containing hundreds of thousands to millions of files).

Unfortunately, there’s no ‘silver bullet’ command or data source available that will provide that count instantaneously: Something will have to perform a treewalk to gather these stats.  That said, there are a couple of approaches to this, each with its pros and cons:

  • If the customer has a SmartQuotas license, they can configure an advisory directory quota on the directories they want to check. As mentioned, the first job run will require working the directory tree, but they can get fast, low impact reports moving forward.
  • Another approach is using traditional UNIX commands, either from the OneFS CLI or, less desirably, from a UNIX client. The two following commands will both take time to run: “
# ls -f /path/to/directory | wc –l
# find /path/to/directory -type f | wc -l

It’s worth noting that when counting files with ls, you’ll probably get faster results by omitting the ‘-l’ flag and using ‘-f’ flag instead. This is because ‘-l’ resolves UID & GIDs to display users/groups, which creates more work thereby slowing the listing. In contrast, ‘-f’ allows the ‘ls’ command to avoid sorting the output. This should be faster and reduce memory consumption when listing extremely large numbers of files.

Ultimately, there really is no quick way to walk a file system and count the files – especially since both ls and find are single threaded commands.  Running either of these in the background with output redirected to a file is probably the best approach.

Depending on your arguments for the ls or find command, you can gather a comprehensive set of context info and metadata on a single pass.

# find /path/to/scan -ls > output.file

It will take quite a while for the command to complete, but once you have the output stashed in a file you can pull all sorts of useful data from it.

Assuming a latency of 10ms per file it would take 33 minutes for 200,000 files. While this estimate may be conservative, there are typically multiple protocol ops that need to be done to each file, and they do add up. Plus, as mentioned before, ‘ls’ is a single threaded command.

  • If possible, ensure the directories of interest are stored on a file pool that has at least one of the metadata mirrors on SSD (metadata-read).
  • Windows Explorer can also enumerate the files in a directory tree surprisingly quickly. All you get is a file count, but it can work pretty well.
  • If the directory you wish to know the file count for just happens to be /ifs, you can run the LinCount job, which will tell you how many LINs there are in the file system.

Lincount (relatively) quickly scans the filesystem and returns the total count of LINs (logical inodes). The LIN count is essentially equivalent to the total file and directory count on a cluster. The job itself runs by default at the LOW priority and is the fastest method of determining object count on OneFS, assuming no other job has run to completion.

The following syntax can be used to kick off the Lincount job from the OneFS CLI:

# isi job start lincount

The output from this will be along the lines of “Added job [52]”.

Note: The number in square brackets is the job ID.

To view results, run the following command from the CLI:

# isi job reports view [job ID]

For example:

# isi job reports view 52
LinCount[52] phase 1 (2021-07-06T09:33:33)
------------------------------------------
Elapsed time 1 seconds
Errors 0
Job mode LinCount
LINs traversed 1722
SINs traversed 0

The "LINs traversed" metric indicates that 1722 files and directories were found.

Note: The Lincount job will also include snapshot revisions of LINs in its count.

Alternatively, if another treewalk job has run against the directory you wish to know the count for, you might be in luck.

At any rate, hundreds of thousands of files is a large number to store in one directory. To reduce the directory enumeration time, where possible divide the files up into multiple subdirectories.

When it comes to NFS, the behavior is going to partially depend on whether the client is doing READDIRPLUS operations vs READDIR. READDIRPLUS is useful if the client is going to need the metadata. However, ff all you’re trying to do is list the filenames, it actually makes that operation much slower.

If you only read the filenames in the directory, and you don’t attempt to stat any associated metadata, then this requires a relatively small amount of I/O to pull the names from the meta-tree and should be fairly fast.

If this has already been done recently, some or all of the blocks are likely to already be in L2 cache. As such, a subsequent operation won’t need to read from hard disk and will be substantially faster.

NFS is more complicated regarding what it will and won’t cache on the client side, particularly with the attribute cache and the timeouts that are associated with it.

Here are some options from fastest to slowest:

  • If NFS is using READDIR, as opposed to READDIRPLUS, and the ‘ls’ command is invoked with the appropriate arguments to prevent it polling metadata or sorting the output, execution will be relatively swift.
  • If ‘ls’ polls the metadata (or if NFS uses READDIRPLUS) but doesn’t sort the results, output will be fairly immediately, but will take longer to complete overall.
  • If ‘ls’ sorts the output, nothing will be displayed until ls has read everything and sorted it, then you’ll get the output in a deluge at the end.

  

Author: Nick Trimbee

 

Read Full Blog
data storage Isilon PowerScale

OneFS NFS Netgroups

Nick Trimbee

Thu, 13 Jan 2022 15:17:23 -0000

|

Read Time: 0 minutes

A OneFS network group, or netgroup, defines a network-wide group of hosts and users. As such, they can be used to restrict access to shared NFS filesystems, etc. Network groups are stored in a network information services, such as LDAP, NIS, or NIS+, rather than in a local file. Netgroups help to simplify the identification and management of people and machines for access control.

The isi_netgroup_d service provides netgroup lookups and caching for consumers of the ‘isi_nfs’ library.  Only mountd and the ‘isi nfs’ command-line interface use this service.  The isi_netgroup_d daemon maintains a fast, persistent cluster-coherent cache containing netgroups and netgroup members.  isi_netgroup_d enforces netgroup TTLs and netgroup retries.  A persistent cache database (SQLite) exists to store and recover cache data across reboots.  Communication with isi_netgroup_d is via RPC and it will register its service and port with the local rpcbind.

Within OneFS, the netgroup cache possesses the following gconfig configuration parameters:

# isi_gconfig -t nfs-config | grep cache

shared_config.bypass_netgroup_cache_daemon (bool) = false

netcache_config.nc_ng_expiration (uint32) = 3600000

netcache_config.nc_ng_lifetime (uint32) = 604800

netcache_config.nc_ng_retry_wait (uint32) = 30000

netcache_config.ncdb_busy_timeout (uint32) = 900000

netcache_config.ncdb_write (uint32) = 43200000

netcache_config.nc_max_hosts (uint32) = 200000

Similarly, the following files are used by the isi_netgroup_d daemon:

File

Purpose

     /var/run/isi_netgroup_d.pid

The pid of the currently running isi_netgroup_d

     /ifs/.ifs/modules/nfs/nfs_config.gc

Server configuration file

     /ifs/.ifs/modules/nfs/netcache.db

Persistent cache database

     /var/log/isi_netgroup_d.log

Log output file

 In general, using IP addresses works better than hostnames for netgroups. This is because hostnames require a DNS lookup and resolution from FQDN to IP address. Using IP addresses directly saves this overhead.

Resolving a large set of hosts in the allow/deny list is significantly faster when using netgroups. Entering a large host list in the NFS export means OneFS has to look up the hosts for each individual NFS export. In Netgroups, once looked up, it is cached by netgroups, so it doesn’t have to be looked up again if there are overlap between exports. It is also better to use an LDAP (or NIS) server when using Netgroups instead of the flat file. If you have a large list of hosts in the netgroups file, it can take a while to resolve as it is single threaded and sequential. LDAP/NIS provider based netgroups lookup is parallelized.

The OneFS netgroup cache has a default limit in gconfig of 200,000 host entries.

# isi_gconfig -t nfs-config | grep max

netcache_config.nc_max_hosts (uint32) = 200000

So, what is the waiting period between when /etc/netgroup is updated to when the NFS export realizes the change? OneFS uses a netgroup cache and both its expiration and lifetime are both tunable. The netgroup expiration and lifetime can be configured with this following CLI command:

# isi nfs netgroup modify

--expiration or -e <duration> 

Set the netgroup expiration time.

--lifetime or -l <duration>

Set the netgroup lifetime.

OneFS also provides the ‘isi nfs netgroups flush’ CLI command, which can be used to force a reload of the file.

# isi nfs netgroup flush

        [--host <string>]

        [{--verbose | -v}]

        [{--help | -h}]
 

Options:

    --host <string>

        IP address of the node to flush. Defaults is all nodes.


  Display Options: 

    --verbose | -v

        Display more detailed information.

    --help | -h

        Display help for this command.

However, it is not recommended to flush the cache as a part of normal cluster operation. Refresh will walk the file and update the cache as needed.

Another area of caution is applying a netgroup with unresolved hostname(s). This will also slow down resolution of the hosts in the file when a refresh or startup of node happens. The best practice is to ensure that each host in the netgroups file is resolvable in DNS, or to just use IP addresses rather than names in the netgroup.

When it comes to switching to a netgroup for clients already on an export, a netgroup can be added and clients removed in one step (#1 –add-client netgroup –remove-clients 1,2,3 ,etc.). The cluster allows a mix of netgroup and host entries, so duplicates are tolerated. However, it’s worth noting that if there are unresolvable hosts in both areas, the startup resolution time will take that much longer.

 

 

Author: Nick Trimbee

Read Full Blog
data storage Isilon PowerScale

OneFS Protocol Auditing

Nick Trimbee

Thu, 13 Jan 2022 15:38:26 -0000

|

Read Time: 0 minutes

Auditing can detect potential sources of data loss, fraud, inappropriate entitlements, access attempts that should not occur, and a range of other anomalies that are indicators of risk. This can be especially useful when the audit associates data access with specific user identities.

In the interests of data security, OneFS provides ‘chain of custody’ auditing by logging specific activity on the cluster. This includes OneFS configuration changes plus NFS, SMB, and HDFS client protocol activity, which are required for organizational IT security compliance, as mandated by regulatory bodies like HIPAA, SOX, FISMA, MPAA, etc.

OneFS auditing uses Dell EMC’s Common Event Enabler (CEE) to provide compatibility with external audit applications. A cluster can write audit events across up to five CEE servers per node in a parallel, load-balanced configuration. This allows OneFS to deliver an end to end, enterprise grade audit solution which efficiently integrates with third party solutions like Varonis DatAdvantage.

OneFS auditing provides control over exactly what protocol activity is audited. For example:

  • Stops collection of unneeded audit events that 3rd party applications do not register for
  • Reduces the number of audit events collected to only what is needed. Less unneeded events are stored on ifs and sent off cluster.

OneFS protocol auditing events are configurable at CEE granularity, with each OneFS event mapping directly to a CEE event. This allows customers to configure protocol auditing to collect only what their auditing application requests, reducing both the number of events discarded by CEE and stored on /ifs.

The ‘isi audit settings’ command syntax and corresponding platform API are used to specify the desired events for the audit filter to collect.

A ‘detail_type’ field within OneFS internal protocol audit events allows a direct mapping to CEE audit events. For example:

“protocol":"SMB2",
 
"zoneID":1,
 
"zoneName":"System",
 
"eventType":"rename",
 
"detailType":"rename-directory",
 
"isDirectory":true,
 
"clientIPAddr":"10.32.xxx.xxx",
 
"fileName":"\\ifs\\test\\New folder",
 
"newFileName":"\\ifs\\test\\ABC",
 
"userSID":"S-1-22-1-0",
 
"userID":0,

Old audit events are processed and mapped to the same CEE audit events as in previous releases. Backwards compatibility is maintained with previous audit events such that old versions ignore the new field. There are no changes to external audit events sent to CEE or syslog.

  • New default audit events when creating an access zone

Here are the protocol audit events:

New OneFS Audit Event

Pre-8.2 Audit Event

create_file

create

create_directory

create

open_file_write

create

open_file_read

create

open_file_noaccess

create

open_directory

create

close_file_unmodified

close

close_file_modified

close

close_directory

close

delete_file

delete

delete_directory

delete

rename_file

rename

rename_directory

rename

set_security_file

set_security

set_security_directory

set_security

get_security_file,

get_security

get_security_directory

get_security

write_file

write

read_file

read

Audit Event

logon

logoff

tree_connect

The ‘isi audit settings’ CLI command syntax is a follows:

Usage:
 
    isi audit <subcommand>
 
Subcommands:
 
    settings    Manage settings related to audit configuration.
 
    topics      Manage audit topics.
 
    logs        Delete out of date audit logs manually & monitor process.
 
    progress    Get the audit event time.

All options that take <events> use the protocol audit events:

# isi audit settings view –zone=<zone>
 
# isi audit settings modify --audit-success=<events> --zone=<zone>
 
# isi audit settings modify --audit-failure=<events> --zone=<zone>
 
# isi audit settings modify --syslog-audit-events=<events> --zone=<zone>

When it comes to troubleshooting audit on a cluster, the ‘isi_audit_viewer’ utility can be used to list protocol audit events collected.

# isi_audit_viewer -h
 
Usage: isi_audit_viewer [ -n <nodeid> | -t <topic> | -s <starttime>|
 
         -e <endtime> | -v ]
 
         -n <nodeid> : Specify node id to browse (default: local node)
 
         -t <topic>  : Choose topic to browse.
 
            Topics are "config" and "protocol" (default: "config")
 
         -s <start>  : Browse audit logs starting at <starttime>
 
         -e <end>    : Browse audit logs ending at <endtime>
 
         -v verbose  : Prints out start / end time range before printing
 
             records

The new audit event type is in the ‘detail_type’ field. Additionally, any errors that are encountered while processing audit events, and when delivering them to an external CEE server, are written to the log file ‘/var/log/isi_audit_cee.log’. Additionally, the protocol specific logs will contain any issues the audit filter has collecting while auditing events.

These protocol log files are:

Protocol

Log file

HDFS

/var/log/hdfs.log

NFS

/var/log/nfs.log

SMB

/var/log/lwiod.log

S3

/var/log/s3.log

 

Author: Nick Trimbee

Read Full Blog
data storage Isilon PowerScale

OneFS Hardware Fault Tolerance

Nick Trimbee

Thu, 13 Jan 2022 15:42:03 -0000

|

Read Time: 0 minutes

There have been several inquiries recently around PowerScale clusters and hardware fault tolerance, above and beyond file level data protection via erasure coding. It seemed like a useful topic for a blog article, so here are some of the techniques which OneFS employs to help protect data against the threat of hardware errors:

File system journal

Every PowerScale node is equipped with a battery backed NVRAM file system journal. Each journal is used by OneFS as stable storage, and guards write transactions against sudden power loss or other catastrophic events. The journal protects the consistency of the file system and the battery charge lasts up to three days. Since each member node of a cluster contains an NVRAM controller, the entire OneFS file system is therefore fully journaled.

Proactive device failure

OneFS will proactively remove, or SmartFail, any drive that reaches a particular threshold of detected Error Correction Code (ECC) errors, and automatically reconstruct the data from that drive and locate it elsewhere on the cluster. Both SmartFail and the subsequent repair process are fully automated and hence require no administrator intervention.

Data integrity

ISI Data Integrity (IDI) is the OneFS process that protects file system structures against corruption via 32-bit CRC checksums. All OneFS blocks, both for file and metadata, utilize checksum verification. Metadata checksums are housed in the metadata blocks themselves, whereas file data checksums are stored as metadata, thereby providing referential integrity. All checksums are recomputed by the initiator, the node servicing a particular read, on every request.

In the event that the recomputed checksum does not match the stored checksum, OneFS will generate a system alert, log the event, retrieve and return the corresponding error correcting code (ECC) block to the client and attempt to repair the suspect data block.

Protocol checksums

In addition to blocks and metadata, OneFS also provides checksum verification for Remote Block Management (RBM) protocol data. As mentioned above, the RBM is a unicast, RPC-based protocol used over the back-end cluster interconnect. Checksums on the RBM protocol are in addition to the InfiniBand hardware checksums provided at the network layer and are used to detect and isolate machines with certain faulty hardware components and exhibiting other failure states.

Dynamic sector repair

OneFS includes a Dynamic Sector Repair (DSR) feature whereby bad disk sectors can be forced by the file system to be rewritten elsewhere. When OneFS fails to read a block during normal operation, DSR is invoked to reconstruct the missing data and write it to either a different location on the drive or to another drive on the node. This is done to ensure that subsequent reads of the block do not fail. DSR is fully automated and completely transparent to the end-user. Disk sector errors and Cyclic Redundancy Check (CRC) mismatches use almost the same mechanism as the drive rebuild process.

MediaScan

MediaScan’s role within OneFS is to check disk sectors and deploy the above DSR mechanism in order to force disk drives to fix any sector ECC errors they may encounter. Implemented as one of the phases of the OneFS job engine, MediaScan is run automatically based on a predefined schedule. Designed as a low-impact, background process, MediaScan is fully distributed and can thereby leverage the benefits of a cluster’s parallel architecture.

IntegrityScan

IntegrityScan, another component of the OneFS job engine, is responsible for examining the entire file system for inconsistencies. It does this by systematically reading every block and verifying its associated checksum. Unlike traditional ‘fsck’ style file system integrity checking tools, IntegrityScan is designed to run while the cluster is fully operational, thereby removing the need for any downtime. In the event that IntegrityScan detects a checksum mismatch, a system alert is generated and written to the syslog and OneFS automatically attempts to repair the suspect block.

The IntegrityScan phase is run manually if the integrity of the file system is ever in doubt. Although this process may take several days to complete, the file system is online and completely available during this time. Additionally, like all phases of the OneFS job engine, IntegrityScan can be prioritized, paused or stopped, depending on the impact to cluster operations and other jobs.

Fault isolation

Because OneFS protects its data at the file-level, any inconsistencies or data loss is isolated to the unavailable or failing device—the rest of the file system remains intact and available.

For example, a ten node, S210 cluster, protected at +2d:1n, sustains three simultaneous drive failures—one in each of three nodes. Even in this degraded state, I/O errors would only occur on the very small subset of data housed on all three of these drives. The remainder of the data striped across the other two hundred and thirty-seven drives would be totally unaffected. Contrast this behavior with a traditional RAID6 system, where losing more than two drives in a RAID-set will render it unusable and necessitate a full restore from backups.

Similarly, in the unlikely event that a portion of the file system does become corrupt (whether as a result of a software or firmware bug, etc.) or a media error occurs where a section of the disk has failed, only the portion of the file system associated with this area on disk will be affected. All healthy areas will still be available and protected.

As mentioned above, referential checksums of both data and meta-data are used to catch silent data corruption (data corruption not associated with hardware failures). The checksums for file data blocks are stored as metadata, outside the actual blocks they reference, and thus provide referential integrity.

Accelerated drive rebuilds

The time that it takes a storage system to rebuild data from a failed disk drive is crucial to the data reliability of that system. With the advent of four terabyte drives, and the creation of increasingly larger single volumes and file systems, typical recovery times for multi-terabyte drive failures are becoming multiple days or even weeks. During this MTTDL period, storage systems are vulnerable to additional drive failures and the resulting data loss and downtime.

Since OneFS is built upon a highly distributed architecture, it’s able to leverage the CPU, memory and spindles from multiple nodes to reconstruct data from failed drives in a highly parallel and efficient manner. Because a PowerScale cluster is not bound by the speed of any particular drive, OneFS is able to recover from drive failures extremely quickly and this efficiency grows relative to cluster size. As such, a failed drive within a cluster will be rebuilt an order of magnitude faster than hardware RAID-based storage devices. Additionally, OneFS has no requirement for dedicated ‘hot-spare’ drives.

Automatic drive firmware updates

Clusters support automatic drive firmware updates for new and replacement drives, as part of the non-disruptive firmware update process. Firmware updates are delivered via drive support packages, which both simplify and streamline the management of existing and new drives across the cluster. This ensures that drive firmware is up to date and mitigates the likelihood of failures due to known drive issues. As such, automatic drive firmware updates are an important component of OneFS’ high availability and non-disruptive operations strategy.

 

 

Author: Nick Trimbee

Read Full Blog
data storage Isilon PowerScale

OneFS and SMB Encryption

Nick Trimbee

Thu, 13 Jan 2022 15:49:36 -0000

|

Read Time: 0 minutes

Received a couple of recent questions around SMB encryption, which is supported in addition to the other components of the SMB3 protocol dialect that OneFS supports, including multi-channel, continuous availability (CA), and witness.

OneFS allows encryption for SMB3 clients to be configured on a per share, zone, or cluster-wide basis. When configuring encryption at the cluster-wide level, OneFS provides the option to also allow unencrypted connections for older, non-SMB3 clients.

The following CLI command will indicate whether SMB3 encryption has already been configured globally on the cluster:

# isi smb settings global view | grep -i encryption
     Support Smb3 Encryption: No

The following table lists what behavior a variety of Microsoft Windows and Apple Mac OS versions will support with respect to SMB3 encryption:

Operating System

Description

Windows Vista/Server 2008

Can only access non-encrypted shares if cluster is configured to allow non-encrypted connections

Windows 7/Server 2008 R2

Can only access non-encrypted shares if cluster is configured to

allow non-encrypted connections

Windows 8/Server 2012

Can access encrypted share (and non-encrypted shares if cluster is configured to allow non-encrypted connections)

Windows 8.1/Server 2012 R2

Can access encrypted share (and non-encrypted shares if cluster is configured to allow non-encrypted connections)

Windows 10/Server 2016

Can access encrypted share (and non-encrypted shares if cluster is configured to allow non-encrypted connections)

OSX10.12

Can access encrypted share (and non-encrypted shares if cluster is configured to allow non-encrypted connections)

 Note that only operating systems which support SMB3 encryption can work with encrypted shares. These operating systems can also work with unencrypted shares, but only if the cluster is configured to allow non-encrypted connections. Other operating systems can access non-encrypted shares only if the cluster is configured to allow non-encrypted connections.

If encryption is enabled for an existing share or zone, and if the cluster is set to only allow encrypted connections, only Windows 8/Server 2012 and later and OSX 10.12 will be able to access that share or zone. Encryption cannot be turned on or off at the client level.

The following CLI procedures will configure SMB3 encryption on a specific share, rather than globally across the cluster:

As a prerequisite, ensure that the cluster and clients are bound and connected to the desired Active Directory domain (for example in this case, ad1.com).

To create a share with SMB3 encryption enabled from the CLI:

# mkdir -p /ifs/smb/data_encrypt
# chmod +a group "AD1\\Domain Users" allow generic_all /ifs/smb/data_encrypt
# isi smb shares create DataEncrypt /ifs/smb/data_encrypt --smb3-encryption-enabled true
 # isi smb shares permission modify DataEncrypt --wellknown Everyone -d allow -p full

To verify that an SMB3 client session is actually being encrypted, launch a remote desktop protocol (RDP) session to the Windows client, log in as administrator, and perform the following:

  1. Ensure a packet capture and analysis tool such as Wireshark is installed.
  2. Start Wireshark capture using the capture filter “port 445
  3. Map the DataEncrypt share from the second node in the cluster
  4. Create a file on the desktop on the client (e.g., README-W10.txt).
  5. Copy the README-W10.txt file from the Desktop on the client to the DataEncrypt shares using Windows explorer.exe
  6. Stop the Wireshark capture
  7. Set the Wireshark the display filter to “smb2 and ip.addr for node 1
    1. Examine the SMB2_NEGOTIATE packet exchange to verify the capabilities, negotiated contexts and protocol dialect (3.1.1)
    2. Examine the SMB2_TREE_CONNECT to verify that encryption support has not been enabled for this share
    3. Examine the SMB2_WRITE requests to ensure that the file contents are readable.
  8. Set the Wireshark the display filter to “smb2 and ip.addr for node 2
    1. Examine the SMB2_NEGOTIATE packet exchange to verify the capabilities, negotiated contexts and protocol dialect (3.1.1)
    2. Examine the SMB2_TREE_CONNECT to verify that encryption support has been enabled for this share
    3. Examine the communication following the successful SMB2_TREE_CONNECT response that the packets are encrypted
  9. Save the Wireshark Capture to the DataEncrypt share using the name Win10-SMB3EncryptionDemo.pcap.

SMB3 encryption can also be applied globally to a cluster. This will mean that all the SMB communication with the cluster will be encrypted, not just with individual shares. SMB clients that don’t support SMB3 encryption will only be able to connect to the cluster so long as it is configured to allow non-encrypted connections. The following table presents the available global SMB3 encryption config options:

Setting

Description

Disabled

Encryption for SMBv3 clients in not enabled on this cluster.

Enable SMB3 encryption

Permits encrypted SMBv3 client connections to Isilon clusters but does not make encryption mandatory. Unencrypted SMBv3 clients can still connect to the cluster when this option is enabled. Note that this setting does not actively enable SMBv3 encryption: To encrypt SMBv3 client connections to the cluster, you must first select this option and then activate encryption on the client side. This setting applies to all shares in the cluster.

 

Reject unencrypted SMB3 client connections

Makes encryption mandatory for all SMBv3 client connections to the cluster. When this setting is active, only encrypted SMBv3 clients can connect to the cluster. SMBv3 clients that do not have encryption enabled are denied access. This setting applies to all shares in the cluster.

The following CLI syntax will configure global SMB3 encryption:

# isi smb settings global modify --support-smb3-encryption=yes

Verify the global encryption settings on a cluster by running:

# isi smb settings global view | grep -i encrypt
Reject Unencrypted Access: Yes
     Support Smb3 Encryption: Yes

Global SMB3 encryption can also be enabled from the WebUI by browsing to Protocols > Windows Sharing (SMB) > SMB Server Settings: 

 

 Author: Nick Trimbee

Read Full Blog
data storage Isilon PowerScale

OneFS File Pool Policies

Nick Trimbee

Thu, 13 Jan 2022 15:56:39 -0000

|

Read Time: 0 minutes

A OneFS file pool policy can be easily generated from either the CLI or WebUI. For example, the following CLI syntax creates a policy which archives older files to a lower storage tier.

# isi filepool policies modify ARCHIVE_OLD --description "Move older files to archive storage" --data-storage-target TIER_A --data-ssd-strategy metadata-write --begin-filter --file-type=file --and --birth-time=2021-01-01 --operator=lt --and --accessed-time= 2021-09-01 --operator=lt --end-filter

After a file match with a File Pool policy occurs, the SmartPools job uses the settings in the matching policy to store and protect the file. However, a matching policy might not specify all settings for the match file. In this case, the default policy is used for those settings not specified in the custom policy. For each file stored on a cluster, the system needs to determine the following:

  • Requested protection level
  • Data storage target for local data cache
  • SSD strategy for metadata and data
  • Protection level for local data cache
  • Configuration for snapshots
  • SmartCache setting
  • L3 cache setting
  • Data access pattern
  • CloudPools actions (if any)

 If no File Pool policy matches a file, the default policy specifies all storage settings for the file. The default policy, in effect, matches all files not matched by any other SmartPools policy. For this reason, the default policy is the last in the file pool policy list, and, as such, always the last policy that SmartPools applies.

Next, SmartPools checks the file’s current settings against those the policy would assign to identify those which do not match.  Once SmartPools has the complete list of settings that it needs to apply to that file, it sets them all simultaneously, and moves to restripe that file to reflect any and all changes to Node Pool, protection, SmartCache use, layout, etc.

Custom File Attributes, or user attributes, can be used when more granular control is needed than can be achieved using the standard file attributes options (File Name, Path, File Type, File Size, Modified Time, Create Time, Metadata Change Time, Access Time).  User Attributes use key value pairs to tag files with additional identifying criteria which SmartPools can then use to apply File Pool policies. While SmartPools has no utility to set file attributes, this can be done easily by using the ‘setextattr’ command.

Custom File Attributes are generally used to designate ownership or create project affinities. Once set, they are leveraged by SmartPools just as File Name, File Type or any other file attribute to specify location, protection and performance access for a matching group of files.

For example, the following CLI commands can be used to set and verify the existence of the attribute ‘key1’ with value ‘val1’ on a file ‘attrib.txt’:

# setextattr user key1 val1 attrib.txt
# getextattr user key1 attrib.txt
 file    val1

A File Pool policy can be crafted to match and act upon a specific custom attribute and/or value.

For example, the File Policy below, created via the OneFS WebUI, will match files with the custom attribute ‘key1=val1’ and move them to the ‘Archive_1’ tier:

 


Once a subset of a cluster’s files have been marked with a custom attribute, either manually or as part of a custom application or workflow, they will then be moved to the Archive_1 tier upon the next successful run of the SmartPools job.

The file system explorer (and ‘isi get –D’ CLI command) provides a detailed view of where SmartPools-managed data is at any time by both the actual Node Pool location and the File Pool policy-dictated location (i.e. where that file will move after the next successful completion of the SmartPools job).

When data is written to the cluster, SmartPools writes it to a single Node Pool only.  This means that, in almost all cases, a file exists in its entirety within a Node Pool, and not across Node Pools.  SmartPools determines which pool to write to based on one of two situations:

  • If a file matches a file pool policy based on directory path, that file will be written into the Node Pool dictated by the File Pool policy immediately.
  • If a file matches a file pool policy which is based on any other criteria besides path name, SmartPools will write that file to the Node Pool with the most available capacity.

If the file matches a file pool policy that places it on a different Node Pool than the highest capacity Node Pool, it will be moved when the next scheduled SmartPools job runs.

For performance, charge back, ownership or security purposes it is sometimes important to know exactly where a specific file or group of files is on disk at any given time.  While any file in a SmartPools environment typically exists entirely in one Storage Pool, there are exceptions when a single file may be split (usually only on a temporary basis) across two or more Node Pools at one time.

SmartPools generally only allows a file to reside in one Node Pool. A file may temporarily span several Node Pools in some situations.  When a file Pool policy dictates a file move from one Node Pool to another, that file will exist partially on the source Node Pool and partially on the Destination Node Pool until the move is complete.  If the Node Pool configuration is changed (for example, when splitting a Node Pool into two Node Pools) a file may be split across those two new pools until the next scheduled SmartPools job runs.  If a Node Pool fills up and data spills over to another Node Pool so the cluster can continue accepting writes, a file may be split over the intended Node Pool and the default Spillover Node Pool.  The last circumstance under which a file may span more than One Node Pool is for typical restriping activities like cross-Node Pool rebalances or rebuilds.


Author: Nick Trimbee

 

Read Full Blog
data storage Isilon PowerScale

OneFS Path-based File Pool Policies

Nick Trimbee

Thu, 13 Jan 2022 16:30:42 -0000

|

Read Time: 0 minutes

As we saw in a previous article, when data is written to the cluster, SmartPools determines which pool to write to based on either path or on any other criteria.

If a file matches a file pool policy which is based on any other criteria besides path name, SmartPools will write that file to the Node Pool with the most available capacity.

However, if a file matches a file pool policy based on directory path, that file will be written into the Node Pool dictated by the File Pool policy immediately.

 

 

If the file matches a file pool policy that places it on a different Node Pool than the highest capacity Node Pool, it will be moved when the next scheduled SmartPools job runs.

If a filepool policy applies to a directory, any new files written to it will automatically inherit the settings from the parent directory. Typically, there is not much variance between the directory and the new file. So, assuming the settings are correct, the file is written straight to the desired pool or tier, with the appropriate protection, etc. This applies to access protocols like NFS and SMB, as well as copy commands like ‘cp’ issued directly from the OneFS command line interface (CLI). However, if the file settings differ from the parent directory, the SmartPools job will correct them and restripe the file. This will happen when the job next runs, rather than at the time of file creation.

However, simply moving a file into the directory (via the UNIX CLI commands such as cp, mv, etc.) will not occur until a SmartPools, SetProtectPlus, Multiscan, or Autobalance job runs to completion. Since these jobs can each perform a re-layout of data, this is when the files will be re-assigned to the desired pool. The file movement can be verified by running the following command from the OneFS CLI:

# isi get -dD <dir>

So the key is whether you’re doing a copy (that is, a new write) or not. As long as you’re doing writes and the parent directory of the destination has the appropriate file pool policy applied, you should get the behavior you want.

One thing to note: If the actual operation that is desired is really a move rather than a copy, it may be faster to change the file pool policy and then do a recursive “isi filepool apply –recurse” on the affected files.

There’s negligible difference between using an NFS or SMB client versus performing the copy on-cluster via the OneFS CLI. As mentioned above, using isi filepool apply will be slightly quicker than a straight copy and delete, since the copy is parallelized above the filesystem layer.

A file pool policy may be crafted which dictates that anything written to path /ifs/path1 is automatically moved directly to the Archive tier. This can easily be configured from the OneFS WebUI by navigating to File System > Storage Pools > File Pool Policies:

 

In the example above, a path based policy is created such that data written to /ifs/path1 will automatically be placed on the cluster’s F600 node pool.

For file Pool Policies that dictate placement of data based on its path, data typically lands on the correct node pool or tier without a SmartPools job running.  File Pool Policies that dictate placement of data on other attributes besides path name get written to Disk Pool with the highest available capacity and then moved, if necessary, to match a File Pool policy, when the next SmartPools job runs.  This ensures that write performance is not sacrificed for initial data placement.

Any data not covered by a File Pool policy is moved to a tier that can be selected as a default for exactly this purpose.  If no Disk Pool has been selected for this purpose, SmartPools will default to the Node Pool with the most available capacity.

Be aware that, when reconfiguring an existing path-based filepool policy to target a different nodepool or tier, the change will not immediately take effect for the new incoming data. The directory where new files will be created must be updated first and there are a several options available to address this:

  • Running the SmartPools job will achieve this. However, this can take a significant amount of time, as the job may entail restriping or migrating a large quantity of file data.
  • Invoking the ’isi filepool apply <path>’ command on a single directory in question will do it very rapidly. This option is ideal for a single, or small number, of ‘incoming’ data directories.
  • To update all directories in a given subtree, but not affect the files’ actual data layouts, use:
# isi filepool apply --dont-restripe --recurse /ifs/path1


  • OneFS also contains the SmartPoolsTree job engine job specifically for this purpose. This can be invoked as follows:
# isi job start SmartPoolsTree --directory-only  --path /ifs/path

For example, a cluster has both an F600 pool and an A2000 pool. A directory (/ifs/path1) is created and a file (file1.txt) written to it:

# mkdir /ifs/path1
# cd !$; touch file1.txt

As we can see, this file is written to the default A2000 pool:

# isi get -DD /ifs/path1/file1.txt | grep -i pool
*  Disk pools:         policy any pool group ID -> data target a2000_200tb_800gb-ssd_16gb:97(97), metadata target a2000_200tb_800gb-ssd_16gb:97(97)

Next, a path-based file pool policy is created such that files written to /ifs/test1 are automatically directed to the cluster’s F600 tier:

# isi filepool policies create test2 --begin-filter --path=/ifs/test1 --and --data-storage-target f600_30tb-ssd_192gb --end-filter
# isi filepool policies list
Name  Description  CloudPools State
------------------------------------
Path1              No access
------------------------------------    
Total: 1
# isi filepool policies view Path1
Name: Path1
Description:
                   CloudPools State: No access
                CloudPools Details: Policy has no CloudPools actions
                       Apply Order: 1
             File Matching Pattern: Path == path1 (begins with)
          Set Requested Protection: -
               Data Access Pattern: -
                  Enable Coalescer: -
                    Enable Packing: -
               Data Storage Target: f600_30tb-ssd_192gb
                 Data SSD Strategy: metadata
           Snapshot Storage Target: -
             Snapshot SSD Strategy: -
                        Cloud Pool: -
         Cloud Compression Enabled: -
          Cloud Encryption Enabled: -
              Cloud Data Retention: -
Cloud Incremental Backup Retention: -
       Cloud Full Backup Retention: -
               Cloud Accessibility: -
                  Cloud Read Ahead: -
            Cloud Cache Expiration: -
         Cloud Writeback Frequency: -
                                ID: Path1

The ‘isi filepool apply’ command is run on /ifs/path1 in order to activate the path-based file policy:

# isi filepool apply /ifs/path1

A file (file-new1.txt) is then created under /ifs/path1:

# touch /ifs/path1/file-new1.txt

An inspection shows that this file is written to the F600 pool, as expected per the Path1 file pool policy:

# isi get -DD /ifs/path1/file-new1.txt | grep -i pool
*  Disk pools:         policy f600_30tb-ssd_192gb(9) -> data target f600_30tb-ssd_192gb:10(10), metadata target f600_30tb-ssd_192gb:10(10)
 

The legacy file (/ifs/path1/file1.txt) is still on the A2000 pool, despite the path-based policy. However, this policy can be enacted on pre-existing data by running the following:

# isi filepool apply --dont-restripe --recurse /ifs/path1

Now, the legacy files are also housed on the F600 pool, and any new writes to the /ifs/path1 directory will also be written to the F600s:

# isi get -DD file1.txt | grep -i pool
*  Disk pools:         policy f600_30tb-ssd_192gb(9) -> data target a2000_200tb_800gb-ssd_16gb:97(97), metadata target a2000_200tb_800gb-ssd_16gb:97(97)

 


Author: Nick Trimbee

Read Full Blog
data storage Isilon PowerScale

PowerScale Gen6 Chassis Hardware Resilience

Nick Trimbee

Thu, 13 Jan 2022 16:48:24 -0000

|

Read Time: 0 minutes

In this article, we’ll take a quick look at the OneFS journal and boot drive mirroring functionality in PowerScale chassis-based hardware:

PowerScale Gen6 platforms, such as the new H700/7000 and A300/3000, stores the local filesystem journal and its mirror in the DRAM of the battery backed compute node blade.  Each 4RU Gen 6 chassis houses four nodes. These nodes comprise a ‘compute node blade’ (CPU, memory, NICs), plus drive containers, or sleds, for each.

A node’s file system journal is protected against sudden power loss or hardware failure by OneFS journal vault functionality – otherwise known as ‘powerfail memory persistence’ (PMP). PMP automatically stores the both the local journal and journal mirror on a separate flash drive across both nodes in a node pair:

This journal de-staging process is known as ‘vaulting’, during which the journal is protected by a dedicated battery in each node until it’s safely written from DRAM to SSD on both nodes in a node-pair. With PMP, constant power isn’t required to protect the journal in a degraded state since the journal is saved to M.2 flash and mirrored on the partner node.

So, the mirrored journal is comprised of both hardware and software components, including the following constituent parts:

Journal Hardware Components

  • System DRAM
  • 2 Vault Flash
  • Battery Backup Unit (BBU)
  • Non-Transparent Bridge (NTB) PCIe link to partner node
  • Clean copy on disk

Journal Software Components

  • Power-fail Memory Persistence (PMP)
  • Mirrored Non-volatile Interface (MNVI)
  • IFS Journal + Node State Block (NSB)
  • Utilities

Asynchronous DRAM Refresh (ADR) preserves RAM contents when the operating system is not running. ADR is important for preserving RAM journal contents across reboots, and it does not require any software coordination to do so.

The journal vault feature encompasses the hardware, firmware, and operating system support that ensure the journal’s contents are preserved across power failure. The mechanism is similar to the NVRAM controller on previous generation nodes but does not use a dedicated PCI card.

On power failure, the PMP vaulting functionality is responsible for copying both the local journal and the local copy of the partner node’s journal to persistent flash. On restoration of power, PMP is responsible for restoring the contents of both journals from flash to RAM and notifying the operating system.

A single dedicated flash device is attached via M.2 slot on the motherboard of the node’s compute module, residing under the battery backup unit (BBU) pack. To be serviced, the entire compute module must be removed.

If the M.2 flash needs to be replaced for any reason, it will be properly partitioned and the PMP structure will be created as part of arming the node for vaulting.

The battery backup unit (BBU), when fully charged, provides enough power to vault both the local and partner journal during a power failure event.

A single battery is utilized in the BBU, which also supports back-to-back vaulting.

On the software side, the journal’s Power-fail Memory Persistence (PMP) provides an equivalent to the NVRAM controller‘s vault/restore capabilities to preserve Journal. The PMP partition on the M.2 flash drive provides an interface between the OS and firmware.

If a node boots and its primary journal is found to be invalid for whatever reason, it has three paths for recourse:

  • Recover journal from its M.2 vault.
  • Recover journal from its disk backup copy.
  • Recover journal from its partner node’s mirrored copy.

A single battery is utilized in the BBU, which also supports back-to-back vaulting.

On the software side, the journal’s Power-fail Memory Persistence (PMP) provides an equivalent to the NVRAM controller‘s vault/restore capabilities to preserve Journal. The PMP partition on the M.2 flash drive provides an interface between the OS and firmware.

If a node boots and its primary journal is found to be invalid for whatever reason, it has three paths for recourse:

  • Recover journal from its M.2 vault.
  • Recover journal from its disk backup copy.
  • Recover journal from its partner node’s mirrored copy.

The mirrored journal must guard against rolling back to a stale copy of the journal on reboot. This necessitates storing information about the state of journal copies outside the journal. As such, the Node State Block (NSB) is a persistent disk block that stores local and remote journal status (clean/dirty, valid/invalid, etc), as well as other non-journal information. NSB stores this node status outside the journal itself and ensures that a node does not revert to a stale copy of the journal upon reboot.

Here’s the detail of an individual node’s compute module:

Of particular note is the ‘journal active’ LED, which is displayed as a white hand icon.

When this white hand icon is illuminated, it indicates that the mirrored journal is actively vaulting, and it is not safe to remove the node!

There is also a blue ‘power’ LED, and a yellow ‘fault’ LED per node. If the blue LED is off, the node may still be in standby mode, in which case it may still be possible to pull debug information from the baseboard management controller (BMC).

The flashing yellow ‘fault’ LED has several state indication frequencies:

Blink Speed

Blink Frequency

Indicator

Fast blink

¼ Hz

BIOS

Medium blink

1 Hz

Extended POST

Slow blink

4 Hz

Booting OS

Off

Off

OS running

The mirrored non-volatile interface (MNVI) sits below /ifs and above RAM and the NTB, provides the abstraction of a reliable memory device to the /ifs journal. MNVI is responsible for synchronizing journal contents to peer node RAM, at the direction of the journal, and persisting writes to both systems while in a paired state. It upcalls into the journal on NTB link events and notifies the journal of operation completion (mirror sync, block IO, etc.).

For example, when rebooting after a power outage, a node automatically loads the MNVI. It then establishes a link with its partner node and synchronizes its journal mirror across the PCIe Non-Transparent Bridge (NTB).

Prior to mounting /ifs, OneFS locates a valid copy of the journal from one of the following locations in order of preference:

Order

Journal Location

Description

1st

Local disk

A local copy that has been backed up to disk

2nd

Local vault

A local copy of the journal restored from Vault into DRAM

3rd

Partner node

A mirror copy of the journal from the partner node

 

If the node was shut down properly, it will boot using a local disk copy of the journal.  The journal will be restored into DRAM and /ifs will mount. On the other hand, if the node suffered a power disruption the journal will be restored into DRAM from the M.2 vault flash instead (the PMP copies the journal into the M.2 vault during a power failure).

In the event that OneFS is unable to locate a valid journal on either the hard drives or M.2 flash on a node, it will retrieve a mirrored copy of the journal from its partner node over the NTB.  This is referred to as ‘Sync-back’.

Note: Sync-back state only occurs when attempting to mount /ifs.

On booting, if a node detects that its journal mirror on the partner node is out of sync (invalid), but the local journal is clean, /ifs will continue to mount.  Subsequent writes are then copied to the remote journal in a process known as ‘sync-forward’.

Here’s a list of the primary journal states:

Journal State

Description

Sync-forward

State in which writes to a journal are mirrored to the partner node.

Sync-back

Journal is copied back from the partner node. Only occurs when attempting to mount /ifs.

Vaulting

Storing a copy of the journal on M.2 flash during power failure. Vaulting is performed by PMP.

 During normal operation, writes to the primary journal and its mirror are managed by the MNVI device module, which writes through local memory to the partner node’s journal via the NTB. If the NTB is unavailable for an extended period, write operations can still be completed successfully on each node. For example, if the NTB link goes down in the middle of a write operation, the local journal write operation will complete. Read operations are processed from local memory.

Additional journal protection for Gen 6 nodes is provided by OneFS powerfail memory persistence (PMP) functionality, which guards against PCI bus errors that can cause the NTB to fail.  If an error is detected, the CPU requests a ‘persistent reset’, during which the memory state is protected and node rebooted. When back up again, the journal is marked as intact and no further repair action is needed.

If a node loses power, the hardware notifies the BMC, initiating a memory persistent shutdown.  At this point the node is running on battery power. The node is forced to reboot and load the PMP module, which preserves its local journal and its partner’s mirrored journal by storing them on M.2 flash.  The PMP module then disables the battery and powers itself off.

Once power is back on and the node restarted, the PMP module first restores the journal before attempting to mount /ifs.  Once done, the node then continues through system boot, validating the journal, setting sync-forward or sync-back states, etc.

During boot, isi_checkjournal and isi_testjournal will invoke isi_pmp. If the M.2 vault devices are unformatted, isi_pmp will format the devices.

On clean shutdown, isi_save_journal stashes a backup copy of the /dev/mnv0 device on the root filesystem, just as it does for the NVRAM journals in previous generations of hardware.

If a mirrored journal issue is suspected, or notified via cluster alerts, the best place to start troubleshooting is to take a look at the node’s log events. The journal logs to /var/log/messages, with entries tagged as ‘journal_mirror’.

The following new CELOG events have also been added in OneFS 8.1 for cluster alerting about mirrored journal issues:

CELOG Event

Description

HW_GEN6_NTB_LINK_OUTAGE

Non-transparent bridge (NTP) PCIe link is unavailable

FILESYS_JOURNAL_VERIFY_FAILURE

No valid journal copy found on node

Another reliability optimization for the Gen6 platform is boot mirroring. Gen6 does not use dedicated bootflash devices, as with previous generation nodes. Instead, OneFS boot and other OS partitions are stored on a node’s data drives. These OS partitions are always mirrored (except for crash dump partitions). The two mirrors protect against disk sled removal. Since each drive in a disk sled belongs to a separate disk pool, both elements of a mirror cannot live on the same sled.

The boot and other OS partitions are 8GB and reserved at the beginning of each data drive for boot mirrors. OneFS automatically rebalances these mirrors in anticipation of, and in response to, service events. Mirror rebalancing is triggered by drive events such as suspend, softfail and hard loss.

The following command will confirm that boot mirroring is working as intended:

# isi_mirrorctl verify

When it comes to smartfailing nodes, here are a couple of other things to be aware of with mirror journal and the Gen6 platform:

  • When you smartfail a node in a node pair, you do not have to smartfail its partner node.
  • A node will still run indefinitely with its partner missing. However, this significantly increases the window of risk since there’s no journal mirror to rely on (in addition to lack of redundant power supply, etc).
  • If you do smartfail a single node in a pair, the journal is still protected by the vault and powerfail memory persistence.

 

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS

PowerScale Platform Update

Nick Trimbee

Fri, 07 Jan 2022 13:49:28 -0000

|

Read Time: 0 minutes

In this blog, we’ll take a quick peek at the new PowerScale Hybrid H700/7000 and Archive A300/3000 hardware platforms that were released in September 2021. So the current PowerScale platform family hierarchy is as follows:

Here’s the lowdown on the new additions to the hardware portfolio:

Model

Tier

Drive per Chassis & Drives

Max Chassis Capacity (16TB HDD)

CPU per Node

Memory per Node

Network

H700

Hybrid/Utility

Standard:

60 x 3.5” HDD

960TB

CPU: 2.9Ghz, 16c

Mem: 384GB

FE: 100GbE

BE: 100GbE or IB

H7000

Hybrid/Utility

Deep:

80 x 3.5” HDD

1280TB

CPU: 2.9Ghz, 16c

Mem: 384GB

FE: 100GbE

BE: 100GbE or IB

A300

Archive

Standard:

60 x 3.5” HDD

960TB

CPU: 1.9Ghz, 16c

Mem: 96GB

FE: 25GbE

BE: 25GbE or IB

A3000

Archive

Deep:

80 x 3.5” HDD

1280TB

CPU: 1.9Ghz, 16c

Mem: 96GB

FE: 25GbE

BE: 25GbE or IB

The PowerScale H700 provides performance and value to support demanding file workloads. With up to 960 TB of HDD per chassis, the H700 also includes inline compression and deduplication capabilities to further extend the usable capacity.

The PowerScale H7000 is a versatile, high performance, high capacity hybrid platform with up to 1280 TB per chassis. The deep chassis based H7000 can consolidate a range of file workloads on a single platform. The H7000 includes inline compression and deduplication capabilities.

On the active archive side, the PowerScale A300 combines performance, near-primary accessibility, value, and ease of use. The A300 provides between 120 TB to 960 TB per chassis and scales to 60 PB in a single cluster. The A300 includes inline compression and deduplication capabilities.

The PowerScale A3000 is an ideal solution for high performance, high density, deep archive storage that safeguards data efficiently for long-term retention. The A3000 stores up to 1280 TB per chassis and scales to north of 80 PB in a single cluster. The A3000 also includes inline compression and deduplication.

These new H700/7000 and A300/3000 nodes require OneFS 9.2.1, and can be seamlessly added to an existing cluster, offering the full complement of OneFS data services including snapshots, replication, quotas, analytics, data reduction, load balancing, and local and cloud tiering. In addition to the storage HDDs, all also contain a small quantity of SSD for L3 cache or metadata acceleration.

Unlike the all-flash PowerScale F900, F600, and F200 stand-alone nodes, which required a minimum of three nodes to form a cluster, a single chassis of four nodes is required to create a cluster, with support for both InfiniBand and Ethernet backend network connectivity.

Each F700/7000 and A300/3000 chassis contains four compute modules (one per node), and five drive containers, or sleds, per node. These sleds occupy bays in the front of each chassis, with a node’s drive sleds stacked vertically:

The drive sled is a tray that slides into the front of the chassis, and contains between three and four 3.5 inch drives in an H700/0 or A300/0, depending on the drive size and configuration of the particular node. Both regular hard drives or self-encrypting drives (SEDs) are available in 2, 4, 8, 12, and 16TB capacities.

Each drive sled has a white ‘not safe to remove’ LED on its front top left, as well as a blue power/activity LED, and an amber fault LED.

The compute modules for each node are housed in the rear of the chassis, and contain CPU, memory, networking, and SSDs, as well as power supplies. Nodes 1 and 2 are a node pair, as are nodes 3 and 4. Each node pair shares a mirrored journal and two power supplies:

Here’s the detail of an individual compute module, which contains a multi core Cascade Lake CPU, memory, M2 flash journal, up to two SSDs for L3 cache, six DIMM channels, front end 40/100 or 10/25 Gb Ethernet, 40/100 or 10/25 Gb Ethernet or InfiniBand, an Ethernet management interface, and power supply and cooling fans:

On the front of each chassis is an LCD front panel control with back-lit buttons and four LED Light Bar Segments – one per node. These LEDs typically display blue for normal operation or yellow to indicate a node fault. This LCD display is hinged so it can be swung clear of the drive sleds for non-disruptive HDD replacement:

So, in summary, the new Gen6 hardware delivers:

  • More Power
    • More cores, more memory, and more cache
    • A300/3000 up to 2x faster than previous generation (A200/2000)
  • More Choice
    • 100GbE, 25GbE, and InfiniBand options for cluster interconnect
    • Node compatibility for all hybrid and archive nodes
    • 30 TB to 320 TB per rack unit
  • More Value
    • Inline data reduction across the PowerScale family
    • Lowest $/GB and most density among comparable solutions

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS node exclusion

OneFS Job Execution and Node Exclusion

Nick Trimbee

Thu, 06 Jan 2022 23:26:13 -0000

|

Read Time: 0 minutes

Up through OneFS 9.2, a job engine job was an all or nothing entity. Whenever a job ran, it involved the entire cluster – regardless of individual node type, load, or condition. As such, any nodes that were overloaded or in a degraded state could still impact the execution ability of the job at large.

To address this, OneFS 9.3 provides the capability to exclude one or more nodes from participating in running a job. This allows the temporary removal of any nodes with high load, or other issues, from the job execution pool so that jobs do not become stuck.

The majority of the OneFS job engine’s jobs have no default schedule and are typically manually started by a cluster administrator or process. Other jobs such as FSAnalyze, MediaScan, ShadowStoreDelete, and SmartPools, are normally started via a schedule. The job engine can also initiate certain jobs on its own. For example, if the SnapshotIQ process detects that a snapshot has been marked for deletion, it will automatically queue a SnapshotDelete job.

The Job Engine will also execute jobs in response to certain system event triggers. In the case of a cluster group change, for example the addition or subtraction of a node or drive, OneFS automatically informs the job engine, which responds by starting a FlexProtect job. The coordinator notices that the group change includes a newly-smart-failed device and then initiates a FlexProtect job in response.

Job administration and execution can be controlled via the WebUI, CLI, or platform API. A job can be started, stopped, paused and resumed, and this is managed via the job engines’ check-pointing system. For each of these control methods, additional administrative security can be configured using roles-based access control (RBAC).

The job engine’s impact control and work throttling mechanism can limit the rate at which individual jobs can run. Throttling is employed at a per-manager process level, so job impact can be managed both granularly and gracefully.

 

Every twenty seconds, the coordinator process gathers cluster CPU and individual disk I/O load data from all the nodes across the cluster. The coordinator uses this information, in combination with the job impact configuration, to decide how many threads can run on each cluster node to service each running job. This can be a fractional number, and fractional thread counts are achieved by having a thread sleep for a given percentage of each second.

Using this CPU and disk I/O load data, every sixty seconds the coordinator evaluates how busy the various nodes are and makes a job throttling decision, instructing the various job engine processes as to the action they need to take. This enables throttling to be sensitive to workloads in which CPU and disk I/O load metrics yield different results. There are also separate load thresholds tailored to the different classes of drives used in OneFS powered clusters, from capacity optimized SATA disks to flash-based SSDs.

Configuration is via the OneFS CLI and gconfig and is global, such that it applies to all jobs on startup. However, the exclusion configuration is not dynamic, and once a job is started with the final node set, there is no further reconfiguration permitted. So if a participant node is excluded, it will remain excluded until the job has completed. Similarly, if a participant needs to be excluded, the current job will have to be cancelled and a new job started. Any nodes can be excluded, including the node running the job engine’s coordinator process. The coordinator will still monitor the job, it just won’t spawn a manager for the job.

The list of participating nodes for a job are computed in three phases:

  1. Query the cluster’s GMP group.
  2. Call to job.get_participating_nodes to get a subset from the gmp group.
  3. Remove the nodes listed in core.excluded_participants from the subset.

The CLI syntax for configuring an excluded nodes list on a cluster is as follows (in this example, excluding nodes one through three):

# isi_gconfig –t job-config core.excluded_participants="{1,2,3}"

The ‘excluded_participants’ are entered as a comma-separated devid value list with no spaces, specified within parentheses and double quotes. All excluded nodes must be specified in full, since there’s no aggregation. Note that, while the excluded participant configuration will be displayed via gconfig, it is not reported as part of the ‘sysctl efs.gmp.group’ output.

A job engine node exclusion configuration can be easily reset to avoid excluding any nodes by assigning the “{}” value.

# isi_gconfig –t job-config core.excluded_participants="{}"
A ‘core.excluded_participant_percent_warn’ parameter defines the maximum percentage of removed nodes.
# isi_gconfig -t job-config core.excluded_participant_percent_warn
core.excluded_participant_percent_warn (uint) = 10

This parameter defaults to 10%, above which a CELOG event warning is generated.

As many nodes as desired can be removed from the job group. CELOG informational event will notify of removed nodes. If too many nodes have been removed (the gconfig parameter sets too many node thresholds), CELOG will fire a warning event. If some nodes are removed but they’re not part of the GMP group, a different warning event will trigger.

If all nodes are removed, a CLI/pAPI error will be returned, the job will fail, and a CELOG warning will fire. For example:

# isi job jobs start LinCount

Job operation failed: The job had no participants left. Check core.excluded_participants setting and make sure there is at least one node to run the job:  Invalid argument

# isi job status

10   LinCount         Failed    2021-10-24T:20:45:23

------------------------------------------------------------------

Total: 9

Note, however, that the following core system maintenance jobs will continue to run across all nodes in a cluster even if a node exclusion has been configured:

  • AutoBalance
  • Collect
  • FlexProtect
  • MediaScan
  • MultiScan

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS Google Cloud Dell EMC PowerScale

Setting Up PowerScale for Google Cloud SmartConnect

Lieven Lin

Wed, 29 Dec 2021 17:47:01 -0000

|

Read Time: 0 minutes

In the Dell EMC PowerScale for Google Cloud solution, OneFS uses the cluster service FQDN as its SmartConnect Zone name with a round-robin client-connection balancing policy. The round-robin policy is a default setting and is recommended for most cases in OneFS. (For more details about the OneFS SmartConnect load-balancing policy, see the Load Balancing section of the white paper Dell EMC PowerScale: Network Design Considerations.)

After the cluster is deployed, you must find the OneFS SmartConnect service IP in the clusters page within Google Cloud Console. Then, configure your DNS server to delegate the cluster service FQDN zone to the OneFS Service IP. You need to configure a forwarding rule in Google Cloud DNS which forwards the cluster service FQDN query to the DNS server, and set up a zone delegation on the DNS server that points to the cluster service IP. The following figure shows the DNS query flow by leveraging Google Cloud DNS along with a DNS server in the project.

  1. VM clients send a DNS request for Cluster service FQDN to the Google Cloud DNS service.
  2. Google Cloud DNS forwards the request to the DNS server.
  3. The DNS server forwards the request to the cluster service IP. The service IP is responsible for translating the cluster service IP into an available node IP.
  4. SmartConnect returns a node IP to the client. The client can now access cluster data.

Because Google Cloud DNS cannot communicate with the OneFS cluster directly, we use a DNS server that is located in the authorized VPC network to forward the SmartConnect DNS request to the cluster. You can use either a Windows server or a Linux server. In this blog we use a Windows server to show the detailed steps.

Obtain required cluster information

The following information is required before setting up SmartConnect:

  • Cluster service FQDN -- This is the OneFS SmartConnect zone name used by clients.
  • Service IP -- This is the OneFS SmartConnect service IP that is responsible for resolving the client DNS request and returning an available node IP to clients.
  • Authorized network -- By default, only the machines on an authorized VPC network can access a PowerScale cluster.

To obtain this required information, do the following:

  1. In the Google Cloud Console navigation menu, click PowerScale and then click Clusters.
  2. Find your cluster row, where you can see the cluster service FQDN and service IP:

3. To find the authorized network information, click the name of the cluster. From the PowerScale Cluster Details page, find the authorized network from the Network information, highlighted here:

Set up a DNS server

If you already have an available DNS server that is connected to the cluster authorized network, you can use this existing DNS server and skip Step 1 and Step 2 below.

  1. In the Google Cloud Console navigation menu, click Compute Engine and then click VM instances. In this example, we are creating a Windows VM instance as a DNS server. Make sure your DNS server is connected to the cluster authorized network.
  2. Log into the DNS server and install DNS Server Role in the Windows machine. (If you are using a Linux machine, you can use Bind software instead.)
  3. Create a new DNS zone in the DNS server:

4. Create an (A) record for the cluster service IP. (See the section DNS delegation best practices of the white paper Dell EMC PowerScale: Network Design Considerations for more details.)

5. Create a new delegation for your cluster service FQDN (sc-demo.tme.local in this example) and point the delegation server to your cluster service IP (A) record created above (sip-demo.tme.local in this example), as shown here:

Configure Cloud DNS and firewall rules

  1. In the Google Cloud Console navigation menu, click Network services and then click Cloud DNS.
  2. Click the CREATE ZONE button.
  3. Choose the Private zone type and enter your Cluster Service FQDN in the DNS name field. Choose Forward queries to another server and your cluster authorized network, as shown here:

4. Obtain the DNS server IP address that you configured in the ‘Set up a DNS server’ step.

5. Point the destination DNS server to your own DNS server IP address, then click the Create button.

6. Add firewall rules to allow ingress DNS traffic to your DNS server from Cloud DNS. In the Google Cloud Console navigation menu, click VPC network and then click Firewall.

7. Click the CREATE FIREWALL RULE button.

8. Create a new Firewall rule and include the following options:

  • In the Network field, make sure the cluster authorized network is selected.
  • Source filter: IPv4 ranges
  • Source IPv4 ranges: 35.199.192.0/19. This is the IP range Cloud DNS requests will originate from. See Cloud DNS zones overview for more details.
  • Protocols and ports: TCP 53 and UDP 53.

See the following example:

4. The resulting firewall rule in Google Cloud will appear as follows:

Verify your SmartConnect

  1. Log into a VM instance that is connected to an authorized network. (This example uses a Linux machine.)
  2. Resolve the cluster service FQDN via nslookup and mount a file share via NFS.

Conclusion

PowerScale cluster is a distributed file system composed of multiple nodes. We always recommend using the SmartConnect feature to balance the client connections to all cluster nodes. This way, you can maximize PowerScale cluster performance to provide maximum value to your business. Try it now in your Dell EMC PowerScale for Google Cloud solution.

Author: Lieven Lin


Read Full Blog
PowerScale OneFS Google Cloud Gallery SIENNA

Live Broadcast Recording Using OneFS for Google Cloud, Gallery SIENNA ND, and Adobe Premiere Pro

Andy Copeland

Fri, 17 Dec 2021 22:19:09 -0000

|

Read Time: 0 minutes

Here at Dell Technologies, we tested a cloud native real-time NDI ISO feed ingest workflow based on Gallery SIENNA, OneFS, and Adobe Premiere Pro, all running natively in Google Cloud.

TL; DR... it's awesome!

Mark Gilbert (CTO at Gallery SIENNA) had noticed there was a growing demand in the market for highly scalable, enterprise-grade file storage in the public cloud for ISO recording. So, we were excited to test this much-needed solution.

Sure, we could have just spun up a cloud-compute instance, created some SMB shares or NFS exports on it, and off you go. But then you quickly find that your ability to scale becomes an issue.

Arguably, the most critical part of any live broadcast is the bulk recording of ISO feeds, and as camera technology improves, recorded data is growing at an ever-increasing pace. Resolutions are increasing, frame rates are faster and internet connection pipes are getting larger.

This is where OneFS for Google Cloud steps in.

Remote production is now a must rather than a nice-to-have for every studio. The production world has had to adopt it, embrace it and buckle in for the ride. There are some great products out there to help businesses enable remote-production workflows. Gallery SIENNA is one of these products. It enables NDI-from-anywhere workflows that help to reduce utilization on over-contended connections.

You can purchase OneFS for Google Cloud through the Google Cloud Marketplace, attach it to a Gallery SIENNA Processing Engine via NFS export and start recording at the click of a button. In our testing, as soon as the recorders began writing, we were able to open and manipulate the files in Adobe Premiere Pro, which we connected to via SMB to prove out that multi-protocol worked too. This was all while the files were being recorded, and we could expand them in real-time in the timeline as they grow.

Infrastructure components (provisioned in Google Cloud):

  • 1 x OneFS for Google Cloud
  • 1 x Ubuntu VM
    • Running Gallery SIENNA ND Processing Engine
  • 1 x Windows 10 VM
    • NDI Tools
    • Adobe Premiere Pro

We used a SIENNA ND Processing Engine to generate six real-time record feeds, three of which were 3840p60 NDI and the other three of 1080p30 DNxHD 145

One of the great benefits of using Gallery SIENNA ND on Google Cloud is that our ingest could have come from anywhere. We could have used any internet-connected device that can reach the Google Cloud instance, be that a static connection in a purpose-built facility or a 4G/5G cell phone camera on the street with the NDI tools on it.

High-level workflow:

  1. Added a Signal Generator node (3840p60) into our SIENNA ND Processing Engine instance
  2. Used the SIENNA ND node-based architecture to add on a timecode burn and frame sync
  3. Added 3 x <NDI Recorder>
  4. Configured the recorders to write out to an NFS export on our OneFS for Google Cloud instance
  5. Added a StreamLink Test node (1080p30) into the same SIENNA ND Processing Engine instance
  6. Added timecode burn and frame sync nodes again
  7. Added 3 x <DNxHD 145 Recorder>
  8. Configured the recorders to write out to the same NFS export on our OneFS for Google Cloud instance
  9. Hit record on all recorders

Once the record was running, we added a "Media Picker" node and selected one of the files that we were recording. Then, we connected this growing file and one of the frame-sync outputs to a "multiviewer" node. We then watched both the live feed and chase play from disk as it was being laid down.

To cap it off, we also mounted one of the output paths using SMB from a Google Cloud hosted Windows 10 instance running Adobe Premiere Pro, and we were able to import, scrub and expand the files as they grew in real-time, allowing us to chase edit.

To find out more about the Dell Technologies offers for Media and Entertainment, feel free to get in touch by DM, or click here to find one of our experts in your time zone.

See the following links for more information about OneFS for Google Cloud and Gallery SIENNA.

Author: Andy Copeland

 



Read Full Blog
PowerScale OneFS Dell EMC PowerScale data inlining

OneFS Data Inlining – Performance and Monitoring

Nick Trimbee

Tue, 16 Nov 2021 19:57:36 -0000

|

Read Time: 0 minutes

In the second of this series of articles on data inlining, we’ll shift the focus to monitoring and performance.

The storage efficiency potential of inode inlining can be significant for data sets comprising large numbers of small files, which would have required a separate inode and data blocks for housing these files prior to OneFS 9.3.

Latency-wise, the write performance for inlined file writes is typically comparable or slightly better as compared to regular files, because OneFS does not have to allocate extra blocks and protect them. This is also true for reads, because OneFS doesn’t have to search for and retrieve any blocks beyond the inode itself. This also frees up space in the OneFS read caching layers, as well as on disk, in addition to requiring fewer CPU cycles.

The following figure illustrates the levels of indirection a file access request takes to get to its data. Unlike a standard file, an inline file skips the later stages of the path, which involve the inode metatree redirection to the remote data blocks.

Access starts with the Superblock, which is located at multiple fixed block addresses on each drive in the cluster. The Superblock contains the address locations of the LIN Master block, which contains the root of the LIN B+ Tree (LIN table).  The LIN B+Tree maps logical inode numbers to the actual inode addresses on disk, which, in the case of an inlined file, also contains the data. This saves the overhead of finding the address locations of the file’s data blocks and retrieving data from them.

For hybrid nodes with sufficient SSD capacity, using the metadata-write SSD strategy automatically places inlined small files on flash media. However, because the SSDs on hybrid nodes default to 512byte formatting, when using metadata read/write strategies, you must set the ‘–force-8k-inodes’ flag for these SSD metadata pools in order for files to be inlined. This can be a useful performance configuration for small file HPC workloads, such as EDA, for data that is not residing on an all-flash tier. But keep in mind that forcing 8KB inodes on a hybrid pool’s SSDs will result in a considerable reduction in available inode capacity than would be available with the default 512 byte inode configuration.

You can use the OneFS ‘isi_drivenum’ CLI command to verify the drive block sizes in a node. For example, the following output for a PowerScale Gen6 H-series node shows drive Bay 1 containing an SSD with 4KB physical formatting and 512byte logical sizes, and Bays A to E comprising hard disks (HDDs) with both 4KB logical and physical formatting.

# isi_drivenum -bz
Bay 1  Physical Block Size: 4096     Logical Block Size:   512
Bay 2  Physical Block Size: N/A     Logical Block Size:   N/A
Bay A0 Physical Block Size: 4096     Logical Block Size:   4096
Bay A1 Physical Block Size: 4096     Logical Block Size:   4096
Bay A2 Physical Block Size: 4096     Logical Block Size:   4096
Bay B0 Physical Block Size: 4096     Logical Block Size:   4096
Bay B1 Physical Block Size: 4096     Logical Block Size:   4096
Bay B2 Physical Block Size: 4096     Logical Block Size:   4096
Bay C0 Physical Block Size: 4096     Logical Block Size:   4096
Bay C1 Physical Block Size: 4096     Logical Block Size:   4096
Bay C2 Physical Block Size: 4096     Logical Block Size:   4096
Bay D0 Physical Block Size: 4096     Logical Block Size:   4096
Bay D1 Physical Block Size: 4096     Logical Block Size:   4096
Bay D2 Physical Block Size: 4096     Logical Block Size:   4096
Bay E0 Physical Block Size: 4096     Logical Block Size:   4096
Bay E1 Physical Block Size: 4096     Logical Block Size:   4096
Bay E2 Physical Block Size: 4096     Logical Block Size:   4096

Note that the SSD disk pools used in PowerScale hybrid nodes that are configured for meta-read or meta-write SSD strategies use 512 byte inodes by default. This can significantly save space on these pools, because they often have limited capacity, but it will prevent data inlining from occurring. By contrast, PowerScale all-flash nodepools are configured by default for 8KB inodes.

The OneFS ‘isi get’ CLI command provides a convenient method to verify which size inodes are in use in a given node pool. The command’s output includes both the inode mirrors size and the inline status of a file.

When it comes to efficiency reporting, OneFS 9.3 provides three CLI improved tools for validating and reporting the presence and benefits of data inlining, namely:

  1. The ‘isi statistics data-reduction’ CLI command has been enhanced to report inlined data metrics, including both a capacity saved and an inlined data efficiency ratio:
# isi statistics data-reduction
                      Recent Writes Cluster Data Reduction
                           (5 mins)
--------------------- ------------- ----------------------
Logical data                 90.16G                 18.05T
Zero-removal saved                0                      -
Deduplication saved           5.25G                624.51G
Compression saved             2.08G                303.46G
Inlined data saved            1.35G                  2.83T
Preprotected physical        82.83G                 14.32T
Protection overhead          13.92G                  2.13T
Protected physical           96.74G                 26.28T
Zero removal ratio         1.00 : 1                      -
Deduplication ratio        1.06 : 1               1.03 : 1
Compression ratio          1.03 : 1               1.02 : 1
Data reduction ratio       1.09 : 1               1.05 : 1
Inlined data ratio         1.02 : 1               1.20 : 1
Efficiency ratio           0.93 : 1               0.69 : 1

Be aware that the effect of data inlining is not included in the data reduction ratio because it is not actually reducing the data in any way – just relocating it and protecting it more efficiently. However, data inlining is included in the overall storage efficiency ratio.

The ‘inline data saved’ value represents the count of files which have been inlined, multiplied by 8KB (inode size).  This value is required to make the compression ratio and data reduction ratio correct.

  1. The ‘isi_cstats’ CLI command now includes the accounted number of inlined files within /ifs, which is displayed by default in its console output.
# isi_cstats
Total files                 : 397234451
Total inlined files         : 379948336
Total directories           : 32380092
Total logical data          : 18471 GB
Total shadowed data         : 624 GB
Total physical data         : 26890 GB
Total reduced data          : 14645 GB
Total protection data       : 2181 GB
Total inode data            : 9748 GB
Current logical data        : 18471 GB
Current shadowed data       : 624 GB
Current physical data       : 26878 GB
Snapshot logical data       : 0 B
Snapshot shadowed data      : 0 B
Snapshot physical data      : 32768 B
Total inlined data savings  : 2899 GB
Total inlined data ratio    : 1.1979 : 1
Total compression savings   : 303 GB
Total compression ratio     : 1.0173 : 1
Total deduplication savings : 624 GB
Total deduplication ratio   : 1.0350 : 1
Total containerized data    : 0 B
Total container efficiency  : 1.0000 : 1
Total data reduction ratio  : 1.0529 : 1
Total storage efficiency    : 0.6869 : 1
Raw counts
{ type=bsin files=3889 lsize=314023936 pblk=1596633 refs=81840315 data=18449 prot=25474 ibyte=23381504 fsize=8351563907072 iblocks=0 }
{ type=csin files=0 lsize=0 pblk=0 refs=0 data=0 prot=0 ibyte=0 fsize=0 iblocks=0 }
{ type=hdir files=32380091 lsize=0 pblk=35537884 refs=0 data=0 prot=0 ibyte=1020737587200 fsize=0 iblocks=0 }
{ type=hfile files=397230562 lsize=19832702476288 pblk=2209730024 refs=81801976 data=1919481750 prot=285828971 ibyte=9446188553728 fsize=17202141701528 iblocks=379948336 }
{ type=sdir files=1 lsize=0 pblk=0 refs=0 data=0 prot=0 ibyte=32768 fsize=0 iblocks=0 }
{ type=sfile files=0 lsize=0 pblk=0 refs=0 data=0 prot=0 ibyte=0 fsize=0 iblocks=0 }
  1. The ‘isi get’ CLI command can be used to determine whether a file has been inlined. The output reports a file’s logical ‘size’, but indicates that it consumes zero physical, data, and protection blocks. There is now also an ‘inlined data’ attribute further down in the output that also confirms that the file is inlined.
# isi get -DD file1
* Size:              2
* Physical Blocks:  0
* Phys. Data Blocks: 0
* Protection Blocks: 0
* Logical Size:      8192
PROTECTION GROUPS
* Dynamic Attributes (6 bytes):
*
ATTRIBUTE           OFFSET SIZE
Policy Domains      0      6
INLINED DATA
0,0,0:8192[DIRTY]#1

So, in summary, some considerations and recommended practices for data inlining in OneFS 9.3 include the following:

  • Data inlining is opportunistic and is only supported on node pools with 8KB inodes.
  • No additional software, hardware, or licenses are required for data inlining.
  • There are no CLI or WebUI management controls for data inlining.
  • Data inlining is automatically enabled on applicable nodepools after an upgrade to OneFS 9.3 is committed.
  • However, data inlining occurs for new writes and OneFS 9.3 does not perform any inlining during the upgrade process. Any applicable small files are instead inlined upon their first write.
  • Since inode inlining is automatically enabled globally on clusters running OneFS 9.3, OneFS recognizes any diskpools with 512 byte inodes and transparently avoids inlining data on them.
  • In OneFS 9.3, data inlining is not performed on regular files during tiering, truncation, upgrade, and so on.
  • CloudPools Smartlink stubs, sparse files, and writable snapshot files are also not candidates for data inlining in OneFS 9.3.
  • OneFS shadow stores do not apply data inlining. As such:
  • Small file packing is disabled for inlined data files.
  • Cloning works as expected with inlined data files.
  • Inlined data files do not apply deduping. Non-inlined data files that are once deduped will not inline afterwards.
  • Certain operations may cause data inlining to be reversed, such as moving files from an 8KB diskpool to a 512 byte diskpool, forcefully allocating blocks on a file, sparse punching, and so on.

The new OneFS 9.3 data inlining feature delivers on the promise of small file storage efficiency at scale, providing significant storage cost savings, without sacrificing performance, ease of use, or data protection.

Author: Nick Trimbee

Read Full Blog
PowerScale OneFS Dell EMC PowerScale data inlining

OneFS Small File Data Inlining

Nick Trimbee

Tue, 16 Nov 2021 19:41:09 -0000

|

Read Time: 0 minutes

OneFS 9.3 introduces a new filesystem storage efficiency feature that stores a small file’s data within the inode, rather than allocating additional storage space. The principal benefits of data inlining in OneFS include:

  • Reduced storage capacity utilization for small file datasets, generating an improved cost per TB ratio
  • Dramatically improved SSD wear life
  • Potential read and write performance improvement for small files
  • Zero configuration, adaptive operation, and full transparency at the OneFS file system level
  • Broad compatibility with other OneFS data services, including compression and deduplication

Data inlining explicitly avoids allocation during write operations because small files do not require any data or protection blocks for their storage. Instead, the file content is stored directly in unused space within the file’s inode. This approach is also highly flash media friendly because it significantly reduces the quantity of writes to SSD drives.

OneFS inodes, or index nodes, are a special class of data structure that store file attributes and pointers to file data locations on disk.  They serve a similar purpose to traditional UNIX file system inodes, but also have some additional unique properties. Each file system object, whether it be a file, directory, symbolic link, alternate data stream container, or shadow store, is represented by an inode.

Within OneFS, SSD node pools in F series all-flash nodes always use 8KB inodes. For hybrid and archive platforms, the HDD node pools are either 512 bytes or 8KB in size, and this is determined by the physical and logical block size of the hard drives or SSDs in a node pool. 

There are three different styles of drive formatting used in OneFS nodes, depending on the manufacturer’s specifications:

Drive Formatting

Characteristics

Native 4Kn (native)

A native 4Kn drive has both a physical and logical block size of 4096B.

512n (native)

A drive that has both physical and logical size of 512 is a native 512B drive.

512e (emulated)

A 512e (512 byte-emulated) drive has a physical block size of 4096, but a logical block size of 512B.

If the drives in a cluster’s nodepool are native 4Kn formatted, by default the inodes on this nodepool will be 8KB in size.  Alternatively, if the drives are 512e formatted, then inodes by default will be 512B in size. However, they can also be reconfigured to 8KB in size if the ‘force-8k-inodes’ setting is set to true.

A OneFS inode is composed of several sections. These include:

  • A static area, which is typically 134 bytes in size and contains fixed-width, commonly used attributes like POSIX mode bits, owner, and file size. 
  • Next, the regular inode contains a metatree cache, which is used to translate a file operation directly into the appropriate protection group. However, for inline inodes, the metatree is no longer required, so data is stored directly in this area instead. 
  • Following this is a preallocated dynamic inode area where the primary attributes, such as OneFS ACLs, protection policies, embedded B+ Tree roots, timestamps, and so on, are cached. 
  • And lastly a sector where the IDI checksum code is stored.

When a file write coming from the writeback cache, or coalescer, is determined to be a candidate for data inlining, it goes through a fast write path in BSW (BAM safe write - the standard OneFS write path). Compression will be applied, if appropriate, before the inline data is written to storage.

The read path for inlined files is similar to that for regular files. However, if the file data is not already available in the caching layers, it is read directly from the inode, rather than from separate disk blocks as with regular files.

Protection for inlined data operates the same way as for other inodes and involves mirroring. OneFS uses mirroring as protection for all metadata because it is simple and does not require the additional processing overhead of erasure coding. The number of inode mirrors is determined by the nodepool’s achieved protection policy, according to the following table:

OneFS Protection Level

Number of Inode Mirrors

+1n

2 inodes per file

+2d:1n

3 inodes per file

+2n

3 inodes per file

+3d:1n

4 inodes per file

+3d:1n1d

4 inodes per file

+3n

4 inodes per file

+4d:1n

5 inodes per file

+4d:2n

5 inodes per file

+4n

5 inodes per file

Unlike file inodes above, directory inodes, which comprise the OneFS single namespace, are mirrored at one level higher than the achieved protection policy. The root of the LIN Tree is the most critical metadata type and is always mirrored at 8x.

Data inlining is automatically enabled by default on all 8KB formatted nodepools for clusters running OneFS 9.3, and does not require any additional software, hardware, or product licenses in order to operate. Its operation is fully transparent and, as such, there are no OneFS CLI or WebUI controls to configure or manage inlining.

In order to upgrade to OneFS 9.3 and benefit from data inlining, the cluster must be running a minimum OneFS 8.2.1 or later. A full upgrade commit to OneFS 9.3 is required before inlining becomes operational.

Be aware that data inlining in OneFS 9.3 does have some notable caveats. Specifically, data inlining will not be performed in the following instances:

  • When upgrading to OneFS 9.3 from an earlier release which does not support inlining
  • During restriping operations, such as SmartPools tiering, when data is moved from a 512 byte diskpool to an 8KB diskpool
  • Writing CloudPools SmartLink stub files
  • On file truncation down to non-zero size
  • Sparse files (for example, NDMP sparse punch files) where allocated blocks are replaced with sparse blocks at various file offsets
  • For files within a writable snapshot

Similarly, in OneFS 9.3 the following operations may cause inlined data inlining to be undone, or spilled:

  • Restriping from an 8KB diskpool to a 512 byte diskpool
  • Forcefully allocating blocks on a file (for example, using the POSIX ‘madvise’ system call)
  • Sparse punching a file
  • Enabling CloudPools BCM (BAM cache manager) on a file

These caveats will be addressed in a future release.

Author: Nick Trimbee


Read Full Blog
Isilon PowerScale OneFS NFS RDMA Dell EMC PowerScale Media and Entertainment 8K

Boosting Storage Performance for Media and Entertainment with RDMA

Gregory Shiff

Tue, 02 Nov 2021 16:40:24 -0000

|

Read Time: 0 minutes

We are in a new golden era of content creation. The explosion of streaming services has brought an unprecedented volume of new and amazing media. Production, post-production, visual effects, animation, finishing: everyone is booked solid with work. And the expectations for this content are higher than ever, with new technically challenging formats becoming the norm rather than the exception. Anyone who has had to work with this content knows that even in 2021, working natively with 8K video or high frame rate 4K video is no joke.  

During post, storage and workstation performance can be huge bottlenecks. These bottlenecks can be particularly painful for “hero” seats that are tasked with working in real time with uncompressed media.

So, let’s look at a new PowerScale OneFS 9.2 feature that can improve storage and workstation performance simultaneously. That technology is Remote Direct Memory Access (RDMA), and specifically NFS over RDMA.

Why NFS? Linux is still the operating system of choice for the applications that media professionals use to work with the most challenging media. Even if those applications have Windows or macOS variants, the Linux version is what is used in the truly high-end. And the native way for a Linux computer to access network storage is NFS. In particular, NFS over TCP.

Already this article is going down a rabbit hole of acronyms! I imagine that most people reading are already familiar with NFS (and SMB) and TCP (and UDP) and on and on. For the benefit of those folks who are not, NFS stands for Network File System. NFS is how Linux systems talk to network storage (there are other ways, but mostly, it is NFS). NFS traffic sits on top of other lower-level network protocols, in particular TCP (or UDP, but mostly it is TCP). TCP does a great job of handling things like packet loss on congested networks, but that comes with performance implications. Back to RDMA.

As the name implies, RDMA is a protocol that allows for a client system to copy data from a storage server’s memory directly into that client’s own memory. And in doing so, the client system bypasses many of the buffering layers inherent in TCP. This direct communication improves storage throughput and reduces latency in moving data between server and client. It also reduces CPU load on both the client and storage server.

RDMA was developed in the 1990s to support high performance compute workloads running over InfiniBand networks. In the 2000s, two methods of running RDMA over Ethernet networks were developed: iWARP and RoCE. Without going into too much detail, iWARP uses TCP for RDMA communications and RoCE uses UDP. There are various benefits and drawbacks of these two approaches. iWARP’s reliance on TCP allows for greater flexibility in network design, but suffers from many of the same performance drawbacks of native TCP communications. RoCE has reduced CPU overhead as compared to iWARP, but requires a lossless network. Once again, without going into too much detail, RoCE is the clear winner given that we are looking for the maximum storage performance with the lowest CPU load. And that is exactly what PowerScale OneFS uses, RoCE (actually RoCEv2, also known as Routable RoCE or RRoCE).

So, put that all together, and you can run NFS traffic over RDMA leveraging RoCE! Yes, back into alphabet soup land. But what this means is that if your environment and PowerScale storage nodes support it, you can massively boost performance by mounting the network storage with a few mount options. And that is a neat trick. The performance gains of RDMA are impressive. In some cases, RDMA is twice as performant as TCP, all other things being equal (with a similar drop in workstation utilization).

A good place to start learning if your PowerScale nodes support RDMA is my colleague Nick Trimbee’s excellent blog: Unstructured Data Tips.

Let’s bring this back to media creation and look at some real-world examples that were tested for this article. The first example is playing an uncompressed 8K DPX image sequence in DaVinci Resolve. Uncompressed video puts less of a strain on the workstation (no real-time decompression), but the file sizes and bandwidth requirements are huge. As an image sequence, each frame of video is a separate file, and at 8K resolution, that meant that each file was approximately 190 MB. To sustain 24 frames per second playback requires 4.5 GB! Long story short, the image sequence would not play with the storage mounted using TCP. Mounting the exact same storage using RDMA was a night and day difference: 8K video at 24 frames per second in Resolve over the network.

Now let’s look at workstation performance. Because to be fair, uncompressed 8K video is unwieldy to store or work with. The number of facilities truly working in uncompressed 8K is small. Far more common is a format such as 6K PIZ compressed OpenEXR. OpenEXR is another image sequence format (file per frame) and PIZ compression is lossless, retaining full image fidelity. The PIZ compressed image sequence I used here had frames that were between 80 MB and 110 MB each. To sustain 24 frames per second requires around 2.7 GB. This bandwidth is less than uncompressed 8K but still substantial. However, the real challenge is that the workstation needs to decompress each frame of video as it is being read. Pulling the 6K image sequence into DaVinci Resolve again and attempting playback over the network storage mounted using TCP did not work. The combination of CPU cycles required for reading the files over network storage and decoding each 6K frame were too much. RDMA was the key for this kind of playback. And sure enough, remounting the storage using RDMA enabled smooth playback of this OpenEXR 6K PIZ image sequence over the network in Resolve.

Going a little deeper with workstation performance, let us look at some other common video formats: Sony XAVC and Apple ProRes 422HQ both at full 4K DCI resolution and 59.94 frames per second. This time AutoDesk Flame 2022 is used as the playback application. Flame has a debug mode that shows video disk dropped frames, GPU dropped frames, and broadcast output dropped frames. With the file system mounted using TCP or RDMA, the video disk never dropped a frame.

The storage is plenty fast enough. However, with the file system mounted using TCP, the broadcast output dropped thousands of frames, and the workstation could not keep up. Playing back the material over RDMA was a different story, smooth broadcast output and essentially no dropped frames at all. In this case, it was all about the CPU cycles freed up by RDMA.

NFS over RDMA is a big deal for PowerScale OneFS environments supporting the highest end playback. The twin benefits of storage performance and workstation CPU savings change what is possible with network storage. For more specifics about the storage environment, the tests run, and how to leverage NFS over RDMA, see my detailed white paper PowerScale OneFS: NFS over RDMA for Media.

Author: Gregory Shiff, Principal Solutions Architect, Media & Entertainment    LinkedIn

Read Full Blog
Isilon data protection security PowerScale OneFS Dell EMC PowerScale

PowerScale OneFS Release 9.3 now supports Secure Boot

Aqib Kazi

Fri, 22 Oct 2021 20:50:20 -0000

|

Read Time: 0 minutes

Many organizations are looking for ways to further secure systems and processes in today's complex security environments. The grim reality is that a device is typically most susceptible to loading malicious malware during its boot sequence.

With the introduction of OneFS 9.3, the UEFI Secure Boot feature is now supported on Isilon A2000 nodes. Not only does the release support the UEFI Secure Boot feature, but OneFS goes a step further by adding FreeBSD’s signature validation. Combining UEFI Secure Boot and FreeBSD’s signature validation helps protect the boot process from potential malware attacks.

The Unified Extensible Firmware Interface (UEFI) Forum standardizes and secures the boot sequence across devices with the UEFI specification. UEFI Secure Boot was introduced in UEFI 2.3.1, allowing only authorized EFI binaries to load.

FreeBSD’s veriexec function is used to perform signature validation for the boot loader and kernel. In addition, the PowerScale Secure Boot feature runs during the node’s bootup process only, using public-key cryptography to verify the signed code, to ensure that only trusted code is loaded on the node.

The Secure Boot feature does not impact cluster performance because the feature is only executed at bootup.

Pre-requisites

The OneFS Secure Boot feature is only supported on Isilon A2000 nodes at this time. The cluster must be upgraded and committed to OneFS 9.3. After the release is committed, proceed with upgrading the Node Firmware Package to 11.3 or higher.

Considerations

PowerScale nodes are not shipped with the Secure Boot feature enabled. The feature must be enabled on each node manually in a cluster. Now, a mixed cluster is supported where some nodes have the Secure Boot feature enabled, and others have it disabled.

A license is not required for the PowerScale Secure Boot feature. The Secure Boot feature can be enabled and disabled at any point, but it requires a maintenance window to reboot the node.

Configuration

You can use IPMI or the BIOS to enable the PowerScale Secure Boot feature, but disabling the feature requires using the BIOS.

For more information about the PowerScale Secure Boot feature, and detailed configuration steps, see the Dell EMC PowerScale OneFS Secure Boot white paper.

For more great information about PowerScale, see the PowerScale Info Hub at: https://infohub.delltechnologies.com/t/powerscale-isilon-1/.

 

Author: Aqib Kazi 

Read Full Blog
automotive PowerScale Dell EMC PowerScale

Dell EMC PowerScale for Developing Autonomous Driving Vehicles

Frances Weiyi Hu

Fri, 22 Oct 2021 16:40:39 -0000

|

Read Time: 0 minutes

Competition in the era of autonomy

The automotive industry is in a highly competitive transitional period where success is not about winning, it’s about survival. Once an industry of pure hardware and adrenaline, automotive design is increasingly dependent upon, and differentiated by, software. This is especially true for Advanced Driver Assistance Systems (ADAS) development, which is introducing disruptive requirements on engineering IT Infrastructure – particularly storage, where even entry level capacities are measured in petabytes. 

SAE International, formerly known as the Society of Automotive Engineers, has defined different levels of automation. Most modern cars today have features that are at level 2-3. Today’s SAE level 3 ADAS projects have already outstripped legacy storage solutions, and with level 4 and 5 projects around the corner, the need for storage solutions optimized for high performance, high concurrency, and massive scalability has never been greater.

The value of data

As ADAS solutions evolve from simple collision-avoidance systems to fully autonomous vehicles, these systems will combine complex sensing, processing, and algorithmic technologies. This vehicle-generated data is a critical component to improving ADAS systems, feeding into integrated test and development cycles (or development tool chains) for these systems. 

In addition to vehicular data, ADAS Test and Development systems in the next three to five years will also rely on inputs from infrastructure to support the growing scale of data movement, computing, and storing required between the vehicles, across the edge, through the cloud, and within on-premises environments. Such data will support ongoing modifications to current simulation, SiL (Software in-the-Loop), and HiL (Hardware-in-the-Loop) testing to improve the reliability of services after deployment.

The following figure illustrates the typical ADAS development life cycle for automotive OEMs and suppliers leveraging the Dell EMC PowerScale scale-out NAS as the central data lake with our Data Management Systems (DMS) for ADAS:

Scaling and evolving ADAS systems will require a seamless data management process and IT infrastructure that is flexible enough to handle challenges such as:

  • Future-proofing ADAS simulation and architecture, to adapt to changes in vehicle sensors and other environmental data points.
  • Managing data storage to comply with regulatory and privacy requirements, while addressing performance, security, and accessibility needs. 
  • Analyzing massive volumes of unstructured data sets, to support analytical modelling and querying of ADAS data. This requires costly and time-consuming data preparation steps, such as labeling data for analysis.
  • Most of the sensor data is required to be used for quick data restoration for decades, so it has to be added to long term archives. 

Ideally architected for ADAS development and certification, Dell EMC PowerScale provides the scalability, performance, parallelism, and easy to use management tools to help OEMs and Tier-1 suppliers accelerate their ADAS projects. PowerScale supports simultaneous ingest from thousands of concurrent streams from around the globe, provides simultaneous access for Model-in-the-loop (MIl), Hardware in the loop (HiL), Software in the loop (SiL) testing, Deep Learning/AI, and includes archive options to meet regulatory resimulation SLAs.

Accelerate and scale your ADAS/AD development success

The data management and computational demands underpinning the ADAS/AD (autonomous driving) test and dev environment are substantial and require solutions that can scale to accommodate complex exponentially growing ADAS/AD data sets. Essential to helping ADAS/AD development teams unlock the data and create value is a high performance and high-capacity platform that can provide the following:

  • A consistent, high throughput solution to ingest data from test vehicles while simultaneously delivering the test data into hundreds to thousands of concurrent streams to MiL/SiL/HiL servers, test stands, and even deep learning training. It must also scale performance near-linearly, so performance isn’t degraded as capacity is added—critical for ADAS development where sensor data ingest rates of 2 PB+ per week are becoming common.
  • A high performance and predictable storage solution that will scale and manage ADAS/AD data sets and workloads as they grow centrally and regionally. Essential elements of the platform include an expandable single namespace, eliminating data silos by consolidating all globally collected ADAS/AD data; automated plug and play hardware detection and expansion that won’t disrupt ongoing projects; automated policy-based tiering to reduce file server sprawl and performance bottlenecks; and file-object orchestration and encryption that will allow data movement between high performance network-attached storage and lower-cost private and public cloud options.
  • Distributed deep learning frameworks are core to unlocking data capital and foundational to ADAS and AD development. Because deep learning models are very complex and large, developers can benefit from using a deep learning framework — an interface, library, or tool that allows them to leverage deep learning easily and quickly, without requiring in-depth understanding of the underlying algorithms. These frameworks provide a clear and concise way for defining models using a collection of pre-built and pre-optimized components. Essential characteristics of well-designed deep learning frameworks, such as Tensorflow, Keras, PyTorch, and Caffe, including optimization for GPU performance, easy to understand code, extensive community support, process parallelization to reduce computation cycles, and automatically computed gradients.
  • An optimized and scalable accelerator-based platform that has the capacity and ability to run AI in place as well as deep learning (training) and MiL/HiL/SiL workloads. Engineers and data scientists continuously manage massive data sets and compute-intensive workloads to run their ADAS/AD test and dev operations across departments and around the globe. A large capacity and distributed GPU-based compute and storage infrastructure gives development teams the ability to rapidly build, train, and deploy test cases and AI models, predictive analytics, simulations, and re-simulations.

Top reasons to choose Dell EMC PowerScale for ADAS/AD

Small footprint, big performance for the edge 

PowerScale F200 and F600 are new small-scale all flash nodes offering high throughput for small deployment scenarios such as on-prem data caching, required when streaming data from public cloud for Hardware-in-the-Loop (HiL) testing, or regional sensor data ingest stations. These low-cost nodes can be added to existing PowerScale/Isilon clusters - making it simple to expand with high performance.

Massive scalability for the data center

AD/ADAS datasets are growing exponentially, with requirements ranging from petabytes to exabytes of data. Dell EMC PowerScale scales as your needs grow so you can invest in infrastructure that fits your current ADAS storage requirements without overbuying performance or capacity. Scalable to tens of petabytes (PB) in a single cluster, PowerScale offers truly scalable performance and an ever-expanding single namespace that eliminates data silos by consolidating all globally collected ADAS/AD data. Tools like CloudPools take this scalability into the exabyte (EB) range, allowing data to be moved between the high-performance NAS and multiple lower-cost storage options like Dell EMC ECS object storage.

Throughput to accelerate ADAS/AD time-to-market 

PowerScale delivers the consistent, high throughput required to concurrently deliver test data into hundreds to thousands of MiL/SiL/HiL servers, test stands, and Deep Learning networks simultaneously. Multiple node types can be deployed within a single cluster, so you can deploy the storage infrastructure that meets your exact needs from high performance all-flash for AI to low-cost SATA for long term archiving. PowerScale also scales performance linearly as additional capacity is added to the cluster – critical for ADAS development where sensor data ingest rates of 2PB+ per week are becoming common.

Maintain sensor data compliance 

Most ADAS projects face strict requirements for data compliance and retention, including data privacy, physical media security, and even service level agreements (SLAs) that dictate retention of petabytes to exabytes of data for decades, with as little as a few days’ notice for full data restoration. Policy-based SmartPools and CloudPools alleviate SLA challenges by automatically tiering data to lower cost storage for long-term retention, and to higher-performance storage for revalidation. Keeping sensor test and verification data within easy reach avoids the “mad-dash” to restore large data sets from archive in the case of a defect, safety recall, or audit. The necessary data remains directly accessible within the PowerScale and ECS storage infrastructure. To protect sensitive sensor data, CloudPools fully encrypts data before offloading it to the target, which can include your own on-premises Dell EMC ECS object storage and third-party providers.

Debug designs faster 

The PowerScale OneFS operating system includes native multi-protocol support so workflows can quickly access data stored on a single cluster, eliminating the need for additional data movement. OneFS offers simultaneous access to all PowerScale nodes for a mix of AD/ADAS workloads from data ingest, MiL, SiL, and HiL testing, to Deep Learning using TensorFlow. OneFS also supports data enrichment with access to on-line databases for weather, GPS location queries, road surface types, and so on. In-place analytics for sensor data and simulation results eliminates the time and expense of moving large data sets between file and other storage solutions typically required. Multi-protocol support includes NFS, SMB, HDFS, SWIFT, HTTP, REST, and others. OneFS also supports S3, an essential protocol for cloud native applications. PowerScale easily integrates with the Dell EMC streaming data platform, offering insights on real-time and historical sensor data.

Dell complete solutions: a complete autonomous driving data lake reference architecture

Our new Dell Autonomous Drive ecosystem supports the most important steps in the ADAS/AD data process. Developed in conjunction with leading industry and technology partners, Dell Autonomous Drive combines Dell Technologies and partner infrastructure, software, and services to offer a complete end-to-end toolchain.

For more information:

Author: Frances Weiyi Hu  LinkedIn


Read Full Blog
PowerScale SMB OneFS

Announcing Drain-based Nondisruptive Upgrades (NDUs)

Vincent Shen

Thu, 16 Sep 2021 17:43:39 -0000

|

Read Time: 0 minutes

During an NDU workflow, nodes are rebooted or the protocol service must be stopped temporarily. Up until now, this has required a disruption for the clients who are connected to the rebooting node.

A drain-based NDU provides a mechanism by which nodes are prevented from rebooting or restarting protocol services until all SMB clients have disconnected from the node. Because a single SMB client that does not disconnect can cause the upgrade to be delayed indefinitely, the user is now provided with options to reboot the node despite persisting clients.

A drain-based upgrade supports the following scenarios and is available for WebUI, CLI, and PAPI:

  • SMB protocol
  • OneFS upgrades
  • Firmware upgrades
  • Cluster reboots
  • Combined upgrades (OneFS and firmware)

A drain-based upgrade is built upon a parallel upgrade workflow, introduced in OneFS 8.2.2.0, that offers parallel node upgrade and reboot activity across node neighborhoods. It upgrades at most one node per neighborhood at any time. By doing that, it can shorten upgrade time and ensure that the end-user can continue to have access to their data. The more node neighborhoods within a cluster, the more parallel activity can occur.

Figure 1 shows how it works. In this example, there are two neighborhoods in a 6-node PowerScale cluster. Nodes 1 thru 3 belong to Neighborhood 1; Nodes 4 thru 6 belong to Neighborhood 2.

Figure 1: An example of Drain based NDU

You can use the following command to identify the correlation between your PowerScale nodes and neighborhoods (failure domains):

# sysctl efs.lin.lock.initiator.coordinator_weights

Once the drain-based upgrade is started, at most one node from each neighborhood will get the reservation that allows the nodes to upgrade simultaneously. OneFS will not reboot these nodes until the number of SMB clients is “0”. In this example, Node 3 and Node 4 get the reservation for upgrading at the same time. However, there is one SMB connection for Node 3 and two SMB connections for Node 4. They will not be able to reboot until the SMB connections get to “0”. At this stage, there are three options:

  • Wait - Wait until the number of SMB connections reaches “0” or it hits the drain timeout value. The drain timeout value is the configurable parameter for each upgrade process. It is the maximum waiting period. If drain timeout is set to “0”, it means wait forever.
  • Delay drain - Add the node to the delay list to delay client draining. The upgrade process will continue on another node in this neighborhood. After all the non-delayed nodes are upgraded, OneFS will return to the node in the delay list.
  • Skip drain - Stop waiting for clients to migrate away from the draining node and reboot immediately.

To run the drain-based NDU, follow these steps:

1. In the OneFS CLI, run the following command to perform the drain-based upgrade. In this example, we have set the drain timeout value to 60 minutes and the alert timeout value to 45 minutes. This means if there is still some connection after 45 minutes, a CELOG notification will be triggered to the administrator.

# isi upgrade start --parallel --skip-optional --install-image-path=/ifs /data/<installation-file-name> --drain-timeout=60m --alert-timeout=45m

The draining service is now waiting for further action (wait, delay, or skip) from the end user, when it detects that there is an active SMB connection between client and PowerScale.

2. In the OneFS WebUI, navigate to Upgrade under Cluster management. In this window you will see the node waiting for draining clients. You can either specify Skip or Delay. In this case, Skip is selected as shown in Figure 2. In the prompt window, click the Skip button to skip draining.

Figure 2. Skip the draining clients

Conclusion

Drain-based NDU can minimize the business impact during the OneFS upgrade process by allowing you to control how and when clients disconnect from the PowerScale cluster. This new feature can significantly improve the user experience and business continuity.

Author: Vincent Shen



Read Full Blog
PowerScale OneFS NFS RDMA

Accelerating your Network File System (NFS) Workloads with RDMA

Lieven Lin

Tue, 14 Sep 2021 15:50:44 -0000

|

Read Time: 0 minutes

The NFS protocol is widely used in datacenters by NAS storage nowadays. It was originally designed for storing and managing data centrally, then sharing data across networks. As technologies evolved, NFS has been used for critical production workloads by many organizations.

NFS is usually implemented over TCP to transfer data. With the emergence of higher speed Ethernet and heavier application workloads running in datacenters, the speed of transferring the ever-increasing volume of data is critical to organizations. The industry has been pursuing new ways to improve NFS protocol performance and to adapt to emerging application workloads. This has made possible using NFS over Remote Direct Memory Access (RDMA).

RDMA enables accessing memory data on a remote machine without passing the data through the CPUs on the system. RDMA therefore enables data to be transferred between storage and clients with higher throughput and lower CPU usage. NFS over RDMA, as defined in RFC8267, uses the advantages of RDMA. Starting with OneFS 9.2.0, OneFS supports NFSv3 over RDMA based on the ROCEv2 (also known as Routable RoCE or RRoCE) network protocol.

To evaluate the improvements and advantages of NFSv3 over RDMA, as compared to NFSv3 over TCP, we ran some FIO sequential read tests, and observed the throughput and CPU usage under different thread counts. The following figure shows the test environment topology and resource configuration.

 

Cluster nodes

Clients

Quantity 

48-node cluster

10

OS Version

OneFS 9.2.1.0

CentOS Linux release 8.3.2011

Model

F600

Dell PowerEdge C4140 clients

Network device

2 * MT28800 Family [ConnectX-5 Ex] * 100GE

2 * MT28908 Family [ConnectX-6] * 100GE

The following chart shows the throughput comparison for RDMA vs. TCP. We found that NFSv3 over RDMA delivers higher throughput than NFSv3 over TCP. (Note: Because 10 test clients cannot overload a 48-node F600 cluster, the throughput number is only used for RDMA and TCP comparison and does not represent the maximum cluster performance.)

The following chart shows the clients’ CPU usage comparison for RDMA vs. TCP. We found that clients consume fewer CPU resources when using NFSv3 over RDMA.

Conclusion

The NFSv3 over RDMA performance improvement does vary as the client thread number increases, as compared to NFSv3 over TCP. Overall, NFSv3 over RDMA delivers higher throughput while providing significant reduction of clients’ CPU overhead. Sequential workloads and CPU-intensive workloads can therefore benefit from using NFSv3 over RDMA on OneFS.

Author: Lieven Lin, LinkedIn



Read Full Blog
PowerScale OneFS protection levels

Unstructured Data Quick Tips - OneFS Protection Overhead

Nick Trimbee

Wed, 08 Sep 2021 20:40:29 -0000

|

Read Time: 0 minutes

There have been several questions from the field recently about how to calculate the OneFS storage protection overhead for different cluster sizes and protection levels. But first, a quick overview of the fundamentals…

OneFS supports several protection schemes. These include the ubiquitous +2d:1n, which protects against two drive failures or one node failure. The best practice is to use the recommended protection level for a particular cluster configuration. This recommended level of protection is clearly marked as ‘suggested’ in the OneFS WebUI storage pools configuration pages and is typically configured by default. For all current Gen6 hardware configurations, the recommended protection level is “+2d:1n”.

The hybrid protection schemes are particularly useful for Gen6 chassis high-density node configurations, where the probability of multiple drives failing far surpasses that of an entire node failure. In the unlikely event that multiple devices have simultaneously failed, such that the file is “beyond its protection level”, OneFS will re-protect everything possible and report errors on the individual files affected to the cluster’s logs.

OneFS also provides a variety of mirroring options ranging from 2x to 8x, allowing from two to eight mirrors of the specified content. Metadata, for example, is mirrored at one level above FEC (forward error correction) by default. For example, if a file is protected at +2n, its associated metadata object will be 3x mirrored.

The full range of OneFS protection levels are as follows:

Protection Level

Description

+1n

Tolerate failure of 1 drive OR 1 node

+2d:1n

Tolerate failure of 2 drives OR 1 node

+2n

Tolerate failure of 2 drives OR 2 nodes

+3d:1n

Tolerate failure of 3 drives OR 1 node

+3d:1n1d

Tolerate failure of 3 drives OR 1 node AND 1 drive

+3n

Tolerate failure of 3 drives or 3 nodes

+4d:1n

Tolerate failure of 4 drives or 1 node

+4d:2n

Tolerate failure of 4 drives or 2 nodes

+4n

Tolerate failure of 4 nodes

2x to 8x

Mirrored over 2 to 8 nodes, depending on configuration

The charts below show the ‘ideal’ protection overhead across the range of node counts and OneFS protection levels (noted within brackets). For each field in this chart, the overhead percentage is calculated by dividing the sum of the two numbers by the number on the right.

x+y => y/(x+y)

So, for a 5-node cluster protected at +2d:1n, OneFS uses an 8+2 layout – hence an ‘ideal’ overhead of 20%.

8+2 => 2/(8+2) = 20%

Number of nodes

[+1n]

[+2d:1n]

[+2n]

[+3d:1n]

[+3d:1n1d]

[+3n]

[+4d:1n]

[+4d:2n]

[+4n]

3

2 +1 (33%)

4 + 2 (33%)

6 + 3 (33%)

3 + 3 (50%)

8 + 4 (33%)

4

3 +1 (25%)

6 + 2 (25%)

9 + 3 (25%)

5 + 3 (38%)

12 + 4 (25%)

4 + 4 (50%)

5

4 +1 (20%)

8+ 2 (20%)

3 + 2 (40%)

12 + 3 (20%)

7 + 3 (30%)

16 + 4 (20%)

6 + 4 (40%)

6

5 +1 (17%)

10 + 2 (17%)

4 + 2 (33%)

15 + 3 (17%)

9 + 3 (25%)

16 + 4 (20%)

8 + 4 (33%)

The ‘x+y’ numbers in each field in the table also represent how files are striped across a cluster for each node count and protection level.

Take for example, with +2n protection on a 6-node cluster, OneFS will write a stripe across all 6 nodes, and use two of the stripe units for parity/ECC and four for data.

In general, for FEC protected data the OneFS protection overhead will look something like below.

Note that the protection overhead % (in brackets) is a very rough guide and will vary across different datasets, depending on quantities of small files, and so on.

Number of nodes

[+1n]

[+2d:1n]

[+2n]

[+3d:1n]

[+3d:1n1d]

[+3n]

[+4d:1n]

[+4d:2n]

[+4n]

3

2 +1 (33%)

4 + 2 (33%)

6 + 3 (33%)

3 + 3 (50%)

8 + 4 (33%)

4

3 +1 (25%)

6 + 2 (25%)

9 + 3 (25%)

5 + 3 (38%)

12 + 4 (25%)

4 + 4 (50%)

5

4 +1 (20%)

8 + 2 (20%)

3 + 2 (40%)

12 + 3 (20%)

7 + 3 (30%)

16 + 4 (20%)

6 + 4 (40%)

6

5 +1 (17%)

10 + 2 (17%)

4 + 2 (33%)

15 + 3 (17%)

9 + 3 (25%)

16 + 4 (20%)

8 + 4 (33%)

7

6 +1 (14%)

12 + 2 (14%)

5 + 2 (29%)

15 + 3 (17%)

11 + 3 (21%)

4 + 3 (43%)

16 + 4 (20%)

10 + 4 (29%)

8

7 +1 (13%)

14 + 2 (12.5%)

6 + 2 (25%)

15 + 3 (17%)

13 + 3 (19%)

5 + 3 (38%)

16 + 4 (20%)

12 + 4 (25%)

9

8 +1 (11%)

16 + 2 (11%)

7 + 2 (22%)

15 + 3 (17%)

15 + 3 (17%)

6 + 3 (33%)

16 + 4 (20%)

14 + 4 (22%)

5 + 4 (44%)

10

9 +1 (10%)

16 + 2 (11%)

8 + 2 (20%)

15 + 3 (17%)

15 + 3 (17%)

7 + 3 (30%)

16 + 4 (20%)

16 + 4 (20%)

6 + 4 (40%)

12

11 +1 (8%)

16 + 2 (11%)

10 + 2 (17%)

15 + 3 (17%)

15 + 3 (17%)

9 + 3 (25%)

16 + 4 (20%)

16 + 4 (20%)

6 + 4 (40%)

14

13 +1 (7%)

16 + 2 (11%)

12 + 2 (14%)

15 + 3 (17%)

15 + 3 (17%)

11 + 3 (21%)

16 + 4 (20%)

16 + 4 (20%)

10 + 4 (29%)

16

15 +1 (6%)

16 + 2 (11%)

14 + 2 (13%)

15 + 3 (17%)

15 + 3 (17%)

13 + 3 (19%)

16 + 4 (20%)

16 + 4 (20%)

12 + 4 (25%)

18

16 +1 (6%)

16 + 2 (11%)

16 + 2 (11%)

15 + 3 (17%)

15 + 3 (17%)

15 + 3 (17%)

16 + 4 (20%)

16 + 4 (20%)

14 + 4 (22%)

20

16 +1 (6%)

16 + 2 (11%)

16 + 2 (11%)

16 + 3 (16%)

16 + 3 (16%)

16 + 3 (16%)

16 + 4 (20%)

16 + 4 (20%)

14 + 4 (22%)

30

16 +1 (6%)

16 + 2 (11%)

16 + 2 (11%)

16 + 3 (16%)

16 + 3 (16%)

16 + 3 (16%)

16 + 4 (20%)

16 + 4 (20%)

14 + 4 (22%)

The protection level of the file is how the system decides to layout the file. A file may have multiple protection levels temporarily (because the file is being restriped) or permanently (because of a heterogeneous cluster). The protection level is specified as “n + m/b@r” in its full form. In the case where b, r, or both equal 1, it may be shortened to get “n + m/b”, “n + m@r”, or “n + m”.

Layout Attribute

Description

N

Number of data drives in a stripe.

+m

Number of FEC drives in a stripe.

/b

Number of drives per stripe allowed on one node.

@r

Number of drives per node to include in a file.

The OneFS protection definition in terms of node and/or drive failures has the advantage of configuration simplicity. However, it does mask some of the subtlety of the interaction between stripe width and drive spread, as represented by the n+m/b notation displayed by the ‘isi get’ CLI command. For example:

# isi get README.txt
POLICY    LEVEL PERFORMANCE COAL  FILE
default   6+2/2 concurrency on    README.txt

In particular, both +3/3 and +3/2 allow for a single node failure or three drive failures and appear the same according to the web terminology. Despite this, they do in fact have different characteristics. +3/2 allows for the failure of any one node in combination with the failure of a single drive on any other node, which +3/3 does not. +3/3, on the other hand, allows for potentially better space efficiency and performance because up to three drives per node can be used, rather than the 2 allowed under +3/2.

Another factor to keep in mind is OneFS neighborhoods. A neighborhood is a fault domain within a node pool. The purpose of neighborhoods is to improve reliability in general – and guard against data unavailability from the accidental removal of Gen6 drive sleds. For self-contained nodes like the PowerScale F200, OneFS has an ideal size of 20 nodes per node pool, and a maximum size of 39 nodes. On the addition of the 40th node, the nodes split into two neighborhoods of 20 nodes.

With the Gen6 platform, the ideal size of a neighborhood changes from 20 to 10 nodes. It also means that a Gen6 nodes pool will never reach the large stripe width (for example 16+3) since the pool will have already split.

This 10-node ideal neighborhood size helps protect the Gen6 architecture against simultaneous node-pair journal failures and full chassis failures. Partner nodes are nodes whose journals are mirrored. Rather than each node storing its journal in NVRAM as in the PowerScale platforms, the Gen6 nodes’ journals are stored on SSDs – and every journal has a mirror copy on another node. The node that contains the mirrored journal is referred to as the partner node. 

There are several reliability benefits gained from the changes to the journal. For example, SSDs are more persistent and reliable than NVRAM, which requires a charged battery to retain state. Also, with the mirrored journal, both journal drives have to die before a journal is considered lost. As such, unless both of the mirrored journal drives fail, both of the partner nodes can function as normal.

With partner node protection, where possible, nodes will be placed in different neighborhoods – and hence different failure domains. Partner node protection is possible once the cluster reaches five full chassis (20 nodes) when, after the first neighborhood split, OneFS places partner nodes in different neighborhoods:

Partner node protection increases reliability because if both nodes go down, they are in different failure domains, so their failure domains only suffer the loss of a single node.

With chassis protection, when possible, each of the four nodes within a chassis will be placed in a separate neighborhood. Chassis protection becomes possible at 40 nodes, as the neighborhood split at 40 nodes enables every node in a chassis to be placed in a different neighborhood. As such, when a 38 node Gen6 cluster is expanded to 40 nodes, the two existing neighborhoods will be split into four 10-node neighborhoods:

Chassis protection ensures that if an entire chassis failed, each failure domain would only lose one node.

 

Author: Nick Trimbee 

Read Full Blog