Diary of a VFX Systems Engineer—Part 1: isi Statistics
Thu, 17 Aug 2023 20:57:36 -0000
|Read Time: 0 minutes
Welcome to the first in a series of blog posts to reveal some helpful tips and tricks when supporting media production workflows on PowerScale OneFS.
OneFS has an incredible user-drivable toolset underneath the hood that can grant you access to data so valuable to your workflow that you'll wonder how you ever lived without it.
When working on productions in the past I’ve witnessed and had to troubleshoot many issues that arise in different parts of the pipeline. Often these are in the render part of the pipeline, which is what I’m going to focus on in this blog.
Render pipelines are normally fairly straightforward in their make-up, but they require everything to be just right to ensure that you don’t starve a cluster of resource, which, if your cluster is at the center of all of your production operations can cause a whole studio outage, causing impact to your creatives, revenue loss, and unnecessary delays in production.
Did you know that any command that is run on a OneFS cluster is an API call down to the OneFS API. This can be observed if you add the --debug flag to any command that you run on the CLI. As shown here, this displays the call information that was sent to gather the information requested, which is helpful if you're integrating your own administration tools into your pipeline.
# isi --debug statistics client list 2023-06-22 10:24:41,086 DEBUG rest.py:80: >>>GET ['3', 'statistics', 'summary', 'client'] 2023-06-22 10:24:41,086 DEBUG rest.py:81: args={'sort': 'operation_rate,in,out,time_avg,node,protocol,class,user.name,local_name,remote_name', 'degraded': 'False', 'timeout': '15'} body={} 2023-06-22 10:24:41,212 DEBUG rest.py:106: <<<(200, {'content-type': 'application/json', 'allow': 'GET, HEAD', 'status': '200 Ok'}, b'n{\n"client" : [ ]\n}\n')
There are so many potential applications for OneFS API calls, from monitoring statistics on the cluster to using your own tools for creating shares, and so on. (We'll go deeper into the API in a future post!)
When we are facing production-stopping activities on a cluster, they're often caused by a rogue process outside the OneFS environment that is as yet unknown to us, which means we have to figure out what that process is and what it is doing.
In walks isi statistics.
By using the isi statistics command, we can very quickly see what is happening on a cluster at any given time. It can give us live reports on which user or connection is causing an issue, how much I/O they're generating as well as what their IP is, what protocol they’re connected using, and so on.
If the cluster is experiencing a sudden slowdown (during a render, for example), we can run a couple of simple statistics commands to show us what the cluster is doing and who's hitting it the hardest. Some examples of these commands are as follows:
isi statistics system --n=all --format=top
Displays all nodes’ real-time statistics in a *NIX “top” style format:
# isi statistics system --n=all --format=top Node CPU SMB FTP HTTP NFS HDFS S3 Total NetIn NetOut DiskIn DiskOut All 33.7% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 401.6 215.6 0.0 0.0 1 33.7% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 401.6 215.6 0.0 0.0
isi statistics client list --totalby=UserName --sort=Ops
This command displays all clients connected and shows their stats, including the UserName they are connected with. It places the users with the highest number of total Ops at the top so that you can track down the user or account that is hitting the storage the hardest.
# isi statistics client --totalby=UserName --sort=Ops Ops In Out TimeAvg Node Proto Class UserName LocalName RemoteName ----------------------------------------------------------------------------- 12.8 12.6M 1.1k 95495.8 * * * root * * -----------------------------------------------------------------------------
isi statistics client --UserName=<username> --sort=Ops
This command goes a bit further and breaks down ALL of the Ops by type being requested by that user. If you know the protocol that the user you’re investigating is using we can also add the operator “--proto=<nfs/smb>” to the command too.
# isi statistics client --user-names=root --sort=Ops Ops In Out TimeAvg Node Proto Class UserName LocalName RemoteName ---------------------------------------------------------------------------------------------- 5.8 6.1M 487.2 142450.6 1 smb2 write root 192.168.134.101 192.168.134.1 2.8 259.2 332.8 497.2 1 smb2 file_state root 192.168.134.101 192.168.134.1 2.6 985.6 549.8 10255.1 1 smb2 create root 192.168.134.101 192.168.134.1 2.6 275.0 570.6 3357.5 1 smb2 namespace_read root 192.168.134.101 192.168.134.1 0.4 85.6 28.0 3911.5 1 smb2 namespace_write root 192.168.134.101 192.168.134.1 ----------------------------------------------------------------------------------------------
The other useful command, particularly when troubleshooting ad hoc performance issues, is isi statistics heat.
isi statistics heat list --totalby=path --sort=Ops | head -12
This command shows the top 10 file paths that are being hit by the largest number of I/O operations.
# isi statistics heat list --totalby=path --sort=Ops | head -12 Ops Node Event Class Path ---------------------------------------------------------------------------------------------------- 141.7 * * * /ifs/ 127.8 * * * /ifs/.ifsvar 86.3 * * * /ifs/.ifsvar/modules 81.7 * * * SYSTEM (0x0) 33.3 * * * /ifs/.ifsvar/modules/tardis 28.6 * * * /ifs/.ifsvar/modules/tardis/gconfig 28.3 * * * /ifs/.ifsvar/upgrade 13.1 * * * /ifs/.ifsvar/upgrade/logs/UpgradeLog-1.db 11.9 * * * /ifs/.ifsvar/modules/tardis/namespaces/healthcheck_schedules.sqlite 10.5 * * * /ifs/.ifsvar/modules/cloud
Once you have all this information, you can now find the user or process (based on IP, UserName, and so on) and figure out what that user is doing and what's causing the render to fail or high I/O generation. In many situations, it will be an asset that is either sitting on a lower-performance tier of the cluster or, if you're using a front side render cache, an asset that is sitting outside of the pre-cached path, so the spindles in the cluster are taking the I/O hit.
For more tips and tricks that can help to save you valuable time, keep checking back. In the meantime, if you have any questions, please feel free to get in touch and I'll do my best to help!
Author: Andy Copeland
Media & Entertainment Solutions Architect
Related Blog Posts
Securing PowerScale OneFS SyncIQ
Tue, 16 Apr 2024 17:55:56 -0000
|Read Time: 0 minutes
In the data replication world, ensuring your PowerScale clusters' security is paramount. SyncIQ, a powerful data replication tool, requires encryption to prevent unauthorized access.
Concerns about unauthorized replication
A cluster might inadvertently become the target of numerous replication policies, potentially overwhelming its resources. There’s also the risk of an administrator mistakenly specifying the wrong cluster as the replication target.
Best practices for security
To secure your PowerScale cluster, Dell recommends enabling SyncIQ encryption as per Dell Security Advisory DSA-2020-039: Dell EMC Isilon OneFS Security Update for a SyncIQ Vulnerability | Dell US. This feature, introduced in OneFS 8.2, prevents man-in-the-middle attacks and addresses other security concerns.
Encryption in new and upgraded clusters
SyncIQ is disabled by default for new clusters running OneFS 9.1. When SyncIQ is enabled, a global encryption flag requires all SyncIQ policies to be encrypted. This flag is also set for clusters upgraded to OneFS 9.1, unless there’s an existing SyncIQ policy without encryption.
Alternative measures
For clusters running versions earlier than OneFS 8.2, configuring a SyncIQ pre-shared key (PSK) offers protection against unauthorized replication policies.
By following these security measures, administrators can ensure that their PowerScale clusters are safeguarded against unauthorized access and maintain the integrity and confidentiality of their data.
SyncIQ encryption: securing data in transit
Securing information as it moves between systems is paramount in the data-driven world. Dell PowerScale OneFS release 8.2 has brought a game-changing feature to the table: end-to-end encryption for SyncIQ data replication. This ensures that data is not only protected while at rest but also as it traverses the network between clusters.
Why encryption matters
Data breaches can be catastrophic, and because data replication involves moving large volumes of sensitive information, encryption acts as a critical shield. With SyncIQ’s encryption, organizations can enforce a global setting that mandates encryption across all SyncIQ policies, to add an extra layer of security.
Test before you implement
It’s crucial to test SyncIQ encryption in a lab environment before deploying it in production. Although encryption introduces minimal overhead, its impact on workflow can vary based on several factors, such as network bandwidth and cluster resources.
Technical underpinnings
SyncIQ encryption is powered by X.509 certificates, TLS version 1.2, and OpenSSL version 1.0.2o6. These certificates are meticulously managed within the cluster’s certificate stores, ensuring a robust and secure data replication process.
Remember, this is just the beginning of a comprehensive guide about SyncIQ encryption. Stay tuned for more insights about configuration steps and best practices for securing your data with Dell PowerScale’s innovative solutions.
Configuration
Configuring SyncIQ encryption requires a supported OneFS release, certificates, and finally, the OneFS configuration. Before enabling SyncIQ encryption in production, test it in a lab environment that mimics the production setup. Measure the impact on transmission overhead by considering network bandwidth, cluster resources, workflow, and policy configuration.
Here’s a high level summary of the configuration steps:
- Ensure compatibility:
- Ensure that the source and target clusters are running OneFS 8.2 or later.
- Upgrade and commit both clusters to OneFS release 8.2 or later.
- Create X.509 certificates:
- Create X.509 certificates for the source and target clusters using publicly available tools.
- The certificate creation process results in the following components:
- Certificate Authority (CA) certificate
- Source certificate and private key
- Target certificate and private key
Note: Some certificate authorities may not generate the public and private key pairs. In that case, manually generate a Certificate Signing Request (CSR) and obtain signed certificates.
3. Transfer certificates to clusters:
- Transfer the certificates to each cluster.
4. Activate each certificate as follows:
- Add the source cluster certificate under Data Protection > SyncIQ > Certificates.
- Add the target server certificate under Data Protection > SyncIQ > Settings.
- Add the Certificate Authority under Access > TLS Certificates and select Import Authority.
5. Enforce encryption:
- Each cluster stores its certificate and its peer’s certificate.
- The source cluster must store the target cluster’s certificate, and vice versa.
- Storing the peer’s certificate creates a list of approved clusters for data replication.
By following these steps, you can secure your data in transit between PowerScale clusters using SyncIQ encryption. Remember to customize the certificates and settings according to your specific environment and requirements.
For more detailed information about configuring SyncIQ encryption, see SyncIQ encryption | Dell PowerScale SyncIQ: Architecture, Configuration, and Considerations | Dell Technologies Info Hub.
SyncIQ pre-shared key
A SyncIQ pre-shared key (PSK) is configured solely on the target cluster to restrict policies from source clusters without the PSK.
Use Cases: This is recommended for environments without SyncIQ encryption, such as clusters pre-OneFS 8.2 or due to other factors.
SmartLock Compliance: Not supported by SmartLock Compliance mode clusters; upgrading and configuring SyncIQ encryption is advised.
Policy Update: After updating source cluster policies with the PSK, no further configuration is needed. Use the isi sync policies view command to verify.
Remember, configuring the PSK will cause all replicating jobs to the target cluster to fail, so ensure that all SyncIQ jobs are complete or canceled before proceeding.
For more detailed information about configuring a SyncIQ pre-shared key, see SyncIQ pre-shared key | Dell PowerScale SyncIQ: Architecture, Configuration, and Considerations | Dell Technologies Info Hub.
Resources
- SyncIQ encryption | Dell PowerScale SyncIQ: Architecture, Configuration, and Considerations | Dell Technologies Info Hub
- SyncIQ pre-shared key | Dell PowerScale SyncIQ: Architecture, Configuration, and Considerations | Dell Technologies Info Hub
Author: Aqib Kazi, Senior Principal Engineering Technologist
OneFS Access Control Lists Overview
Thu, 18 Jan 2024 22:29:13 -0000
|Read Time: 0 minutes
As we know, when users access OneFS cluster data via different protocols, the final permission enforcement happens on the OneFS file system. In OneFS, this is achieved by the Access Control Lists (ACLs) implementation, which provides granular permission control on directories and files. In this article, we will look at the basics of OneFS ACLs.
OneFS ACL
OneFS provides a single namespace for multiprotocol access and has its own internal ACL representation to perform access control. The internal ACL is presented as protocol-specific views of permissions so that NFS exports display POSIX mode bits for NFSv3 and ACL for NFSv4 and SMB.
When connecting to an PowerScale cluster with SSH, you can manage not only POSIX mode bits but also ACLs with standard UNIX tools such as chmod commands. In addition, you can edit ACL policies through the web administration interface to configure OneFS permissions management for networks that mix Windows and UNIX systems.
The OneFS ACL design is derived from Windows NTFS ACL. As such, many of its concept definitions and operations are similar to the Windows NTFS ACL, such as ACE permissions and inheritance.
OneFS synthetic ACL and real ACL
To deliver cross-protocol file access seamlessly, OneFS stores an internal representation of a file-system object’s permissions. The internal representation can contain information from the POSIX mode bits or the ACL.
OneFS has two types of ACLs to fulfill different scenarios:
- OneFS synthetic ACL: Under the default ACL policy, if no inheritable ACL entries exist on a parent directory – such as when a file or directory is created through a NFS or SSH session on OneFS within the parent directory – the directory will only contain POSIX mode bits permission. OneFS uses the internal representation to generate a OneFS synthetic ACL, which is an in-memory structure that approximates the POSIX mode bits of a file or directory for an SMB or NFSv4 client.
- OneFS real ACL: Under the default ACL policy, when a file or directory is created through SMB or when the synthetic ACL of a file or directory is modified through an NFSv4 or SMB client, the OneFS real ACL is initialized and stored on disk. The OneFS real ACL can also be initialized using the OneFS enhanced chmod command tool with the +a, -a, or =a option to modify the ACL.
OneFS access control entries
In contrast to the Windows DACL and NFSv4 ACL, the OneFS ACL access control entry (ACE) adds an additional identity type. OneFS ACEs contain the following information:
- Identity name: The name of a user or group
- ACE type: The type of the ACE (allow or deny)
- ACE permissions and inheritance flags: A list of permissions and inheritance flags separated with commas
OneFS ACE permissions
Similar to the Windows permission level, OneFS divides permissions into the following three types:
- Standard ACE permissions: These apply to any object in the file system
- Generic ACE permissions: These map to a bundle of specific permissions
- Constant ACE permissions: These are specific permissions for file-system objects
The standard ACE permissions that can appear for a file-system object are shown in the following table:
ACE permission | Applies to | Description |
std_delete | Directory or file | The right to delete the object |
std_read_dac | Directory or file | The right to read the security descriptor, not including the SACL |
std_write_dac | Directory or file | The right to modify the DACL in the object's security descriptor |
std_write_owner | Directory or file | The right to change the owner in the object's security descriptor |
std_synchronize | Directory or file | The right to use the object as a thread synchronization primitive |
std_required | Directory or file | Maps to std_delete, std_read_dac, std_write_dac, and std_write_owner |
The generic ACE permissions that can appear for a file system object are shown in the following table:
ACE permission | Applies to | Description |
generic_all | Directory or file | Read, write, and execute access. Maps to file_gen_all or dir_gen_all. |
generic_read | Directory or file | Read access. Maps to file_gen_read or dir_gen_read. |
generic_write | Directory or file | Write access. Maps to file_gen_write or dir_gen_write. |
generic_exec | Directory or file | Execute access. Maps to file_gen_execute or dir_gen_execute. |
dir_gen_all | Directory | Maps to dir_gen_read, dir_gen_write, dir_gen_execute, delete_child, and std_write_owner. |
dir_gen_read | Directory | Maps to list, dir_read_attr, dir_read_ext_attr, std_read_dac, and std_synchronize. |
dir_gen_write | Directory | Maps to add_file, add_subdir, dir_write_attr, dir_write_ext_attr, std_read_dac, and std_synchronize. |
dir_gen_execute | Directory | Maps to traverse, std_read_dac, and std_synchronize. |
file_gen_all | File | Maps to file_gen_read, file_gen_write, file_gen_execute, delete_child, and std_write_owner. |
file_gen_read | File | Maps to file_read, file_read_attr, file_read_ext_attr, std_read_dac, and std_synchronize. |
file_gen_write | File | Maps to file_write, file_write_attr, file_write_ext_attr, append, std_read_dac, and std_synchronize. |
file_gen_execute | File | Maps to execute, std_read_dac, and std_synchronize. |
The constant ACE permissions that can appear for a file-system object are shown in the following table:
ACE permission | Applies to | Description |
modify | File | Maps to file_write, append, file_write_ext_attr, file_write_attr, delete_child, std_delete, std_write_dac, and std_write_owner |
file_read | File | The right to read file data |
file_write | File | The right to write file data |
append | File | The right to append to a file |
execute | File | The right to execute a file |
file_read_attr | File | The right to read file attributes |
file_write_attr | File | The right to write file attributes |
file_read_ext_attr | File | The right to read extended file attributes |
file_write_ext_attr | File | The right to write extended file attributes |
delete_child | Directory or file | The right to delete children, including read-only files within a directory; this is currently not used for a file, but can still be set for Windows compatibility |
list | Directory | List entries |
add_file | Directory | The right to create a file in the directory |
add_subdir | Directory | The right to create a subdirectory |
traverse | Directory | The right to traverse the directory |
dir_read_attr | Directory | The right to read directory attributes |
dir_write_attr | Directory | The right to write directory attributes |
dir_read_ext_attr | Directory | The right to read extended directory attributes |
dir_write_ext_attr | Directory | The right to write extended directory attributes |
OneFS ACL inheritance
Inheritance allows permissions to be layered or overridden as needed in an object hierarchy and allows for simplified permissions management. The semantics of OneFS ACL inheritance are the same as Windows ACL inheritance and will feel familiar to someone versed in Windows NTFS ACL inheritance. The following table shows the ACE inheritance flags defined in OneFS:
ACE inheritance flag | Set on directory or file | Description |
object_inherit | Directory only | Indicates an ACE applies to the current directory and files within the directory |
container_inherit | Directory only | Indicates an ACE applies to the current directory and subdirectories within the directory |
inherit_only | Directory only | Indicates an ACE applies to subdirectories only, files only, or both within the directory. |
no_prop_inherit | Directory only | Indicates an ACE applies to the current directory or only the first-level contents of the directory, not the second-level or subsequent contents |
inherited_ace | File or directory | Indicates an ACE is inherited from the parent directory |
Author: Lieven Lin