OneFS Caching Hierarchy
Tue, 22 Mar 2022 20:05:56 -0000
|Read Time: 0 minutes
Caching occurs in OneFS at multiple levels, and for a variety of types of data. For this discussion we’ll concentrate on the caching of file system structures in main memory and on SSD.
OneFS’ caching infrastructure design is based on aggregating each individual node’s cache into one cluster wide, globally accessible pool of memory. This is done by using an efficient messaging system, which allows all the nodes’ memory caches to be available to each and every node in the cluster.
For remote memory access, OneFS uses the Sockets Direct Protocol (SDP) over an Ethernet or Infiniband (IB) backend interconnect on the cluster. SDP provides an efficient, socket-like interface between nodes which, by using a switched star topology, ensures that remote memory addresses are only ever one hop away. While not as fast as local memory, remote memory access is still very fast due to the low latency of the backend network.
OneFS uses up to three levels of read cache, plus an NVRAM-backed write cache, or write coalescer. The first two types of read cache, level 1 (L1) and level 2 (L2), are memory (RAM) based, and analogous to the cache used in CPUs. These two cache layers are present in all PowerScale storage nodes. An optional third tier of read cache, called SmartFlash, or Level 3 cache (L3), is also configurable on nodes that contain solid state drives (SSDs). L3 cache is an eviction cache that is populated by L2 cache blocks as they are aged out from memory.
The OneFS caching subsystem is coherent across the cluster. This means that if the same content exists in the private caches of multiple nodes, this cached data is consistent across all instances. For example, consider the following scenario:
- Node 2 and Node 4 each have a copy of data located at an address in shared cache.
- Node 4, in response to a write request, invalidates node 2’s copy.
- Node 4 then updates the value.
- Node 2 must re-read the data from shared cache to get the updated value.
OneFS uses the MESI Protocol to maintain cache coherency, implementing an “invalidate-on-write” policy to ensure that all data is consistent across the entire shared cache. The various states that in-cache data can take are:
M – Modified: The data exists only in local cache, and has been changed from the value in shared cache. Modified data is referred to as ‘dirty’.
E – Exclusive: The data exists only in local cache, but matches what is in shared cache. This data referred to as ‘clean’.
S – Shared: The data in local cache may also be in other local caches in the cluster.
I – Invalid: A lock (exclusive or shared) has been lost on the data.
L1 cache, or front-end cache, is memory that is nearest to the protocol layers (such as NFS, SMB, and so on) used by clients, or initiators, connected to that node. The main task of L1 is to prefetch data from remote nodes. Data is pre-fetched per file, and this is optimized to reduce the latency associated with the nodes’ IB back-end network. Because the IB interconnect latency is relatively small, the size of L1 cache, and the typical amount of data stored per request, is less than L2 cache.
L1 is also known as remote cache because it contains data retrieved from other nodes in the cluster. It is coherent across the cluster, but is used only by the node on which it resides, and is not accessible by other nodes. Data in L1 cache on storage nodes is aggressively discarded after it is used. L1 cache uses file-based addressing, in which data is accessed by means of an offset into a file object. The L1 cache refers to memory on the same node as the initiator. It is only accessible to the local node, and typically the cache is not the primary copy of the data. This is analogous to the L1 cache on a CPU core, which may be invalidated as other cores write to main memory. L1 cache coherency is managed using a MESI-like protocol using distributed locks, as described above.
L2, or back-end cache, refers to local memory on the node on which a particular block of data is stored. L2 reduces the latency of a read operation by not requiring a seek directly from the disk drives. As such, the amount of data prefetched into L2 cache for use by remote nodes is much greater than that in L1 cache.
L2 is also known as local cache because it contains data retrieved from disk drives located on that node and then made available for requests from remote nodes. Data in L2 cache is evicted according to a Least Recently Used (LRU) algorithm. Data in L2 cache is addressed by the local node using an offset into a disk drive which is local to that node. Because the node knows where the data requested by the remote nodes is located on disk, this is a very fast way of retrieving data destined for remote nodes. A remote node accesses L2 cache by doing a lookup of the block address for a particular file object. As described above, there is no MESI invalidation necessary here and the cache is updated automatically during writes and kept coherent by the transaction system and NVRAM.
L3 cache is a subsystem that caches evicted L2 blocks on a node. Unlike L1 and L2, not all nodes or clusters have an L3 cache, because it requires solid state drives (SSDs) to be present and exclusively reserved and configured for caching use. L3 serves as a large, cost-effective way of extending a node’s read cache from gigabytes to terabytes. This allows clients to retain a larger working set of data in cache, before being forced to retrieve data from higher latency spinning disk. The L3 cache is populated with “interesting” L2 blocks dropped from memory by L2’s least recently used cache eviction algorithm. Unlike RAM based caches, because L3 is based on persistent flash storage, once the cache is populated, or warmed, it’s highly durable and persists across node reboots, and so on. L3 uses a custom log-based file system with an index of cached blocks. The SSDs provide very good random read access characteristics, such that a hit in L3 cache is not that much slower than a hit in L2.
To use multiple SSDs for cache effectively and automatically, L3 uses a consistent hashing approach to associate an L2 block address with one L3 SSD. In the event of an L3 drive failure, a portion of the cache will obviously disappear, but the remaining cache entries on other drives will still be valid. Before a new L3 drive can be added to the hash, some cache entries must be invalidated.
OneFS also uses a dedicated inode cache in which recently requested inodes are kept. The inode cache frequently has a large impact on performance, because clients often cache data, and many network I/O activities are primarily requests for file attributes and metadata, which can be quickly returned from the cached inode.
OneFS provides tools to accurately assess the performance of the various levels of cache at a point in time. These cache statistics can be viewed from the OneFS CLI using the isi_cache_stats command. Statistics for L1, L2, and L3 cache are displayed for both data and metadata. For example:
# isi_cache_stats Totals l1_data: a 224G 100% r 226G 100% p 318M 77%, l1_encoded: a 0.0B 0% r 0.0B 0% p 0.0B 0%, l1_meta: r 4.5T 99% p 152K 48%, l2_data: r 1.2G 95% p 115M 79%, l2_meta: r 27G 72% p 28M 3%, l3_data: r 0.0B 0% p 0.0B 0%, l3_meta: r 8G 99% p 0.0B 0%
For more detailed and formatted output, a verbose option of the command is available using the ‘isi_cache_stats -v’ option.
It’s worth noting that for L3 cache, the prefetch statistics will always read zero, since it’s a pure eviction cache and does not use data or metadata prefetch.
Due to balanced data distribution, automatic rebalancing, and distributed processing, OneFS is able to leverage additional CPUs, network ports, and memory as the system grows. This also allows the caching subsystem (and, by virtue, throughput and IOPS) to scale linearly with the cluster size.
Author: Nick Trimbee
Related Blog Posts
PowerScale OneFS 9.7
Wed, 13 Dec 2023 13:55:00 -0000
|Read Time: 0 minutes
Dell PowerScale is already powering up the holiday season with the launch of the innovative OneFS 9.7 release, which shipped today (13th December 2023). This new 9.7 release is an all-rounder, introducing PowerScale innovations in Cloud, Performance, Security, and ease of use.
After the debut of APEX File Storage for AWS earlier this year, OneFS 9.7 extends and simplifies the PowerScale in the public cloud offering, delivering more features on more instance types across more regions.
In addition to providing the same OneFS software platform on-prem and in the cloud, and customer-managed for full control, APEX File Storage for AWS in OneFS 9.7 sees a 60% capacity increase, providing linear capacity and performance scaling up to six SSD nodes and 1.6 PiB per namespace/cluster, and up to 10GB/s reads and 4GB/s writes per cluster. This can make it a solid fit for traditional file shares and home directories, vertical workloads like M&E, healthcare, life sciences, finserv, and next-gen AI, ML and analytics applications.
Enhancements to APEX File Storage for AWS
PowerScale’s scale-out architecture can be deployed on customer managed AWS EBS and ECS infrastructure, providing the scale and performance needed to run a variety of unstructured workflows in the public cloud. Plus, OneFS 9.7 provides an ‘easy button’ for streamlined AWS infrastructure provisioning and deployment.
Once in the cloud, you can further leverage existing PowerScale investments by accessing and orchestrating your data through the platform's multi-protocol access and APIs.
This includes the common OneFS control plane (CLI, WebUI, and platform API), and the same enterprise features: Multi-protocol, SnapshotIQ, SmartQuotas, Identity management, and so on.
With OneFS 9.7, APEX File Storage for AWS also sees the addition of support for HDFS and FTP protocols, in addition to NFS, SMB, and S3. Granular performance prioritization and throttling is also enabled with SmartQoS, allowing admins to configure limits on the maximum number of protocol operations that NFS, S3, SMB, or mixed protocol workloads can consume on an APEX File Storage for AWS cluster.
Security
With data integrity and protection being top of mind in this era of unprecedented cyber threats, OneFS 9.7 brings a bevy of new features and functionality to keep your unstructured data and workloads more secure than ever. These new OneFS 9.7 security enhancements help address US Federal and DoD mandates, such as FIPS 140-2 and DISA STIGs – in addition to general enterprise data security requirements. Included in the new OneFS 9.7 release is a simple cluster configuration backup and restore utility, address space layout randomization, and single sign-on (SSO) lookup enhancements.
Data mobility
On the data replication front, SmartSync sees the introduction of GCP as an object storage target in OneFS 9.7, in addition to ECS, AWS and Azure. The SmartSync data mover allows flexible data movement and copying, incremental resyncs, push and pull data transfer, and one-time file to object copy.
Performance improvements
Building on the streaming read performance delivered in a prior release, OneFS 9.7 also unlocks dramatic write performance enhancements, particularly for the all-flash NVMe platforms - plus infrastructure support for future node hardware platform generations. A sizable boost in throughput to a single client helps deliver performance for the most demanding GenAI workloads, particularly for the model training and inferencing phases. Additionally, the scale-out cluster architecture enables performance to scale linearly as GPUs are increased, allowing PowerScale to easily support AI workflows from small to large.
Cluster support for InsightIQ 5.0
The new InsightIQ 5.0 software expands PowerScale monitoring capabilities, including a new user interface, automated email alerts, and added security. InsightIQ 5.0 is available today for all existing and new PowerScale customers at no additional charge. These innovations are designed to simplify management, expand scale and security, and automate operations for PowerScale performance monitoring for AI, GenAI, and all other workloads.
In summary, OneFS 9.7 brings the following new features and functionality to the Dell PowerScale ecosystem:
We’ll be taking a deeper look at these new features and functionality in blog articles over the course of the next few weeks.
Meanwhile, the new OneFS 9.7 code is available on the Dell Support site, as both an upgrade and reimage file, allowing both installation and upgrade of this new release.
OneFS SSL Certificate Renewal – Part 1
Thu, 16 Nov 2023 04:57:00 -0000
|Read Time: 0 minutes
When using either the OneFS WebUI or platform API (pAPI), all communication sessions are encrypted using SSL (Secure Sockets Layer), also known as Transport Layer Security (TLS). In this series, we will look at how to replace or renew the SSL certificate for the OneFS WebUI.
SSL requires a certificate that serves two principal functions: It grants permission to use encrypted communication using Public Key Infrastructure and authenticates the identity of the certificate’s holder.
Architecturally, SSL consists of four fundamental components:
SSL Component | Description |
Alert | Reports issues. |
Change cipher spec | Implements negotiated crypto parameters. |
Handshake | Negotiates crypto parameters for SSL session. Can be used for many SSL/TCP connections. |
Record | Provides encryption and MAC. |
These sit in the stack as follows:
The basic handshake process begins with a client requesting an HTTPS WebUI session to the cluster. OneFS then returns the SSL certificate and public key. The client creates a session key, encrypted with the public key it is received from OneFS. At this point, the client only knows the session key. The client now sends its encrypted session key to the cluster, which decrypts it with the private key. Now, both the client and OneFS know the session key. So, finally, the session, encrypted using a symmetric session key, can be established. OneFS automatically defaults to the best supported version of SSL, based on the client request.
A PowerScale cluster initially contains a self-signed certificate, which can be used as-is or replaced with a third-party certificate authority (CA)-issued certificate. If the self-signed certificate is used upon expiry, it must be replaced with either a third-party (public or private) CA-issued certificate or another self-signed certificate that is generated on the cluster. The following are the default locations for the server.crt and server.key files.
File | Location |
SSL certificate | /usr/local/apache2/conf/ssl.crt/server.crt |
SSL certificate key | /usr/local/apache2/conf/ssl.key/server.key |
The ‘isi certificate settings view’ CLI command displays all of the certificate-related configuration options. For example:
# isi certificate settings view Certificate Monitor Enabled: Yes Certificate Pre Expiration Threshold: 4W2D Default HTTPS Certificate ID: default Subject: C=US, ST=Washington, L=Seattle, O="Isilon", OU=Isilon, CN=Dell, emailAddress=tme@isilon.com Status: valid |
The above ‘certificate monitor enabled’ and ‘certificate pre expiration threshold’ configuration options govern a nightly cron job, which monitors the expiration of each managed certificate and fires a CELOG alert if a certificate is set to expire within the configured threshold. Note that the default expiration is 30 days (4W2D, which represents 4 weeks plus 2 days). The ‘ID: default’ configuration option indicates that this certificate is the default TLS certificate.
The basic certificate renewal or creation flow is as follows:
The steps below include options to complete a self-signed certificate replacement or renewal, or to request an SSL replacement or renewal from a Certificate Authority (CA).
Backing up the existing SSL certificate
The first task is to obtain the list of certificates by running the following CLI command, and identify the appropriate one to renew:
# isi certificate server list ID Name Status Expires ------------------------------------------- eb0703b default valid 2025-10-11T10:45:52 ------------------------------------------- |
It’s always a prudent practice to save a backup of the original certificate and key. This can be easily accomplished using the following CLI commands, which, in this case, create the directory ‘/ifs/data/ssl_bkup’ directory, set the perms to root-only access, and copy the original key and certificate to it:
# mkdir -p /ifs/data/ssl_bkup # chmod 700 /ifs/data/ssl_bkup # cp /usr/local/apache24/conf/ssl.crt/server.crt /ifs/data/ssl_bkup # cp /usr/local/apache24/conf/ssl.key/server.key /ifs/data/ssl_bkup # cd !$ cd /ifs/data/ssl_bkup # ls server.crt server.key |
Renewing or creating a certificate
The next step in the process involves either the renewal of an existing certificate or creation of a certificate from scratch. In either case, first, create a temporary directory, for example /ifs/tmp:
# mkdir /ifs/tmp; cd /ifs/tmp |
a) Renew an existing self-signed Certificate.
The following syntax creates a renewal certificate based on the existing ssl.key. The value of the ‘-days’ parameter can be adjusted to generate a certificate with the wanted expiration date. For example, the following command will create a one-year certificate.
# cp /usr/local/apache2/conf/ssl.key/server.key ./ ; openssl req -new -days 365 -nodes -x509 -key server.key -out server.crt |
Answer the system prompts to complete the self-signed SSL certificate generation process, entering the pertinent information location and contact information. For example:
Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:Washington
Locality Name (eg, city) []:Seattle
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Isilon
Organizational Unit Name (eg, section) []:TME
Common Name (e.g. server FQDN or YOUR name) []:isilon.com
Email Address []:tme@isilon.com
When all the information has been successfully entered, the server.csr and server.key files will be generated under the /ifs/tmp directory.
Optionally, the attributes and integrity of the certificate can be verified with the following syntax:
# openssl x509 -text -noout -in server.crt |
Next, proceed directly to the ‘Add the certificate to the cluster’ steps in section 4 of this article.
b) Alternatively, a certificate and key can be generated from scratch, if preferred.
The following CLI command can be used to create an 2048-bit RSA private key:
# openssl genrsa -out server.key 2048 Generating RSA private key, 2048 bit long modulus ............+++++
...........................................................+++++
e is 65537 (0x10001) |
Next, create a certificate signing request:
# openssl req -new -nodes -key server.key -out server.csr |
For example:
# openssl req -new -nodes -key server.key -out server.csr -reqexts SAN -config <(cat /etc/ssl/openssl.cnf <(printf "[SAN]\nsubjectAltName=DNS:isilon.com")) You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:US State or Province Name (full name) [Some-State]:WA Locality Name (eg, city) []:Seattle Organization Name (eg, company) [Internet Widgits Pty Ltd]:Isilon Organizational Unit Name (eg, section) []:TME Common Name (e.g. server FQDN or YOUR name) []:h7001 Email Address []:tme@isilon.com Please enter the following 'extra' attributes to be sent with your certificate request A challenge password []:1234 An optional company name []: # |
Answer the system prompts to complete the self-signed SSL certificate generation process, entering the pertinent information location and contact information. Additionally, a ‘challenge password’ with a minimum of 4-bytes in length will need to be selected and entered.
As prompted, enter the information to be incorporated into the certificate request. When completed, the server.csr and server.key files will appear in the /ifs/tmp directory.
If wanted, a CSR file for a Certificate Authority, which includes Subject-Alternative-Names (SAN) can be generated. For example, additional host name entries can be added using a comma (IE. DNS:isilon.com,DNS:www.isilon.com).
In the next article, we will look at the certificate singing, addition, and verification steps of the process.