Diary of a VFX Systems Engineer—Part 1: isi Statistics
Thu, 17 Aug 2023 20:57:36 -0000
|Read Time: 0 minutes
Welcome to the first in a series of blog posts to reveal some helpful tips and tricks when supporting media production workflows on PowerScale OneFS.
OneFS has an incredible user-drivable toolset underneath the hood that can grant you access to data so valuable to your workflow that you'll wonder how you ever lived without it.
When working on productions in the past I’ve witnessed and had to troubleshoot many issues that arise in different parts of the pipeline. Often these are in the render part of the pipeline, which is what I’m going to focus on in this blog.
Render pipelines are normally fairly straightforward in their make-up, but they require everything to be just right to ensure that you don’t starve a cluster of resource, which, if your cluster is at the center of all of your production operations can cause a whole studio outage, causing impact to your creatives, revenue loss, and unnecessary delays in production.
Did you know that any command that is run on a OneFS cluster is an API call down to the OneFS API. This can be observed if you add the --debug flag to any command that you run on the CLI. As shown here, this displays the call information that was sent to gather the information requested, which is helpful if you're integrating your own administration tools into your pipeline.
# isi --debug statistics client list 2023-06-22 10:24:41,086 DEBUG rest.py:80: >>>GET ['3', 'statistics', 'summary', 'client'] 2023-06-22 10:24:41,086 DEBUG rest.py:81: args={'sort': 'operation_rate,in,out,time_avg,node,protocol,class,user.name,local_name,remote_name', 'degraded': 'False', 'timeout': '15'} body={} 2023-06-22 10:24:41,212 DEBUG rest.py:106: <<<(200, {'content-type': 'application/json', 'allow': 'GET, HEAD', 'status': '200 Ok'}, b'n{\n"client" : [ ]\n}\n')
There are so many potential applications for OneFS API calls, from monitoring statistics on the cluster to using your own tools for creating shares, and so on. (We'll go deeper into the API in a future post!)
When we are facing production-stopping activities on a cluster, they're often caused by a rogue process outside the OneFS environment that is as yet unknown to us, which means we have to figure out what that process is and what it is doing.
In walks isi statistics.
By using the isi statistics command, we can very quickly see what is happening on a cluster at any given time. It can give us live reports on which user or connection is causing an issue, how much I/O they're generating as well as what their IP is, what protocol they’re connected using, and so on.
If the cluster is experiencing a sudden slowdown (during a render, for example), we can run a couple of simple statistics commands to show us what the cluster is doing and who's hitting it the hardest. Some examples of these commands are as follows:
isi statistics system --n=all --format=top
Displays all nodes’ real-time statistics in a *NIX “top” style format:
# isi statistics system --n=all --format=top Node CPU SMB FTP HTTP NFS HDFS S3 Total NetIn NetOut DiskIn DiskOut All 33.7% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 401.6 215.6 0.0 0.0 1 33.7% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 401.6 215.6 0.0 0.0
isi statistics client list --totalby=UserName --sort=Ops
This command displays all clients connected and shows their stats, including the UserName they are connected with. It places the users with the highest number of total Ops at the top so that you can track down the user or account that is hitting the storage the hardest.
# isi statistics client --totalby=UserName --sort=Ops Ops In Out TimeAvg Node Proto Class UserName LocalName RemoteName ----------------------------------------------------------------------------- 12.8 12.6M 1.1k 95495.8 * * * root * * -----------------------------------------------------------------------------
isi statistics client --UserName=<username> --sort=Ops
This command goes a bit further and breaks down ALL of the Ops by type being requested by that user. If you know the protocol that the user you’re investigating is using we can also add the operator “--proto=<nfs/smb>” to the command too.
# isi statistics client --user-names=root --sort=Ops Ops In Out TimeAvg Node Proto Class UserName LocalName RemoteName ---------------------------------------------------------------------------------------------- 5.8 6.1M 487.2 142450.6 1 smb2 write root 192.168.134.101 192.168.134.1 2.8 259.2 332.8 497.2 1 smb2 file_state root 192.168.134.101 192.168.134.1 2.6 985.6 549.8 10255.1 1 smb2 create root 192.168.134.101 192.168.134.1 2.6 275.0 570.6 3357.5 1 smb2 namespace_read root 192.168.134.101 192.168.134.1 0.4 85.6 28.0 3911.5 1 smb2 namespace_write root 192.168.134.101 192.168.134.1 ----------------------------------------------------------------------------------------------
The other useful command, particularly when troubleshooting ad hoc performance issues, is isi statistics heat.
isi statistics heat list --totalby=path --sort=Ops | head -12
This command shows the top 10 file paths that are being hit by the largest number of I/O operations.
# isi statistics heat list --totalby=path --sort=Ops | head -12 Ops Node Event Class Path ---------------------------------------------------------------------------------------------------- 141.7 * * * /ifs/ 127.8 * * * /ifs/.ifsvar 86.3 * * * /ifs/.ifsvar/modules 81.7 * * * SYSTEM (0x0) 33.3 * * * /ifs/.ifsvar/modules/tardis 28.6 * * * /ifs/.ifsvar/modules/tardis/gconfig 28.3 * * * /ifs/.ifsvar/upgrade 13.1 * * * /ifs/.ifsvar/upgrade/logs/UpgradeLog-1.db 11.9 * * * /ifs/.ifsvar/modules/tardis/namespaces/healthcheck_schedules.sqlite 10.5 * * * /ifs/.ifsvar/modules/cloud
Once you have all this information, you can now find the user or process (based on IP, UserName, and so on) and figure out what that user is doing and what's causing the render to fail or high I/O generation. In many situations, it will be an asset that is either sitting on a lower-performance tier of the cluster or, if you're using a front side render cache, an asset that is sitting outside of the pre-cached path, so the spindles in the cluster are taking the I/O hit.
For more tips and tricks that can help to save you valuable time, keep checking back. In the meantime, if you have any questions, please feel free to get in touch and I'll do my best to help!
Author: Andy Copeland
Media & Entertainment Solutions Architect