Diary of a VFX Systems Engineer—Part 1: isi Statistics

Welcome to the first in a series of blog posts to reveal some helpful tips and tricks when supporting media production workflows on PowerScale OneFS.

OneFS has an incredible user-drivable toolset underneath the hood that can grant you access to data so valuable to your workflow that you'll wonder how you ever lived without it.

When working on productions in the past I’ve witnessed and had to troubleshoot many issues that arise in different parts of the pipeline. Often these are in the render part of the pipeline, which is what I’m going to focus on in this blog.

Render pipelines are normally fairly straightforward in their make-up, but they require everything to be just right to ensure that you don’t starve a cluster of resource, which, if your cluster is at the center of all of your production operations can cause a whole studio outage, causing impact to your creatives, revenue loss, and unnecessary delays in production.

Did you know that any command that is run on a OneFS cluster is an API call down to the OneFS API. This can be observed if you add the --debug flag to any command that you run on the CLI. As shown here, this displays the call information that was sent to gather the information requested, which is helpful if you're integrating your own administration tools into your pipeline.

# isi --debug statistics client list
        2023-06-22 10:24:41,086 DEBUG rest.py:80: >>>GET ['3', 'statistics', 'summary', 'client']
        2023-06-22 10:24:41,086 DEBUG rest.py:81:    args={'sort': 'operation_rate,in,out,time_avg,node,protocol,class,user.name,local_name,remote_name', 'degraded': 'False', 'timeout': '15'}
        body={}
        2023-06-22 10:24:41,212 DEBUG rest.py:106: <<<(200, {'content-type': 'application/json', 'allow': 'GET, HEAD', 'status': '200 Ok'}, b'n{\n"client" : [  ]\n}\n')

There are so many potential applications for OneFS API calls, from monitoring statistics on the cluster to using your own tools for creating shares, and so on. (We'll go deeper into the API in a future post!)

When we are facing production-stopping activities on a cluster, they're often caused by a rogue process outside the OneFS environment that is as yet unknown to us, which means we have to figure out what that process is and what it is doing.

In walks isi statistics.

By using the isi statistics command, we can very quickly see what is happening on a cluster at any given time. It can give us live reports on which user or connection is causing an issue, how much I/O they're generating as well as what their IP is, what protocol they’re connected using, and so on.

If the cluster is experiencing a sudden slowdown (during a render, for example), we can run a couple of simple statistics commands to show us what the cluster is doing and who's hitting it the hardest. Some examples of these commands are as follows:

isi statistics system --n=all --format=top

Displays all nodes’ real-time statistics in a *NIX “top” style format:

# isi statistics system --n=all --format=top
Node   CPU SMB FTP HTTP NFS HDFS  S3 Total NetIn NetOut DiskIn DiskOut
 All 33.7% 0.0 0.0  0.0 0.0   0.0 0.0   0.0 401.6  215.6     0.0     0.0
   1 33.7% 0.0 0.0  0.0 0.0   0.0 0.0   0.0 401.6  215.6     0.0     0.0

isi statistics client list --totalby=UserName --sort=Ops

This command displays all clients connected and shows their stats, including the UserName they are connected with. It places the users with the highest number of total Ops at the top so that you can track down the user or account that is hitting the storage the hardest.

# isi statistics client --totalby=UserName --sort=Ops
 Ops     In  Out  TimeAvg   Node  Proto  Class   UserName  LocalName  RemoteName
-----------------------------------------------------------------------------
12.8 12.6M 1.1k  95495.8     *       *      *      root          *           *
-----------------------------------------------------------------------------

isi statistics client --UserName=<username> --sort=Ops

This command goes a bit further and breaks down ALL of the Ops by type being requested by that user. If you know the protocol that the user you’re investigating is using we can also add the operator “--proto=<nfs/smb>” to the command too.

# isi statistics client --user-names=root --sort=Ops
 Ops     In   Out  TimeAvg   Node  Proto           Class  UserName        LocalName    RemoteName
----------------------------------------------------------------------------------------------
 5.8   6.1M 487.2 142450.6     1   smb2           write      root 192.168.134.101 192.168.134.1
 2.8 259.2 332.8    497.2      1   smb2      file_state      root 192.168.134.101 192.168.134.1
 2.6 985.6 549.8  10255.1      1   smb2          create      root 192.168.134.101 192.168.134.1
 2.6 275.0 570.6   3357.5      1   smb2  namespace_read      root 192.168.134.101 192.168.134.1
 0.4   85.6  28.0   3911.5      1   smb2 namespace_write      root 192.168.134.101 192.168.134.1
----------------------------------------------------------------------------------------------

The other useful command, particularly when troubleshooting ad hoc performance issues, is isi statistics heat.

isi statistics heat list --totalby=path --sort=Ops | head -12

This command shows the top 10 file paths that are being hit by the largest number of I/O operations.

# isi statistics heat list --totalby=path --sort=Ops | head -12
  Ops   Node  Event  Class Path
----------------------------------------------------------------------------------------------------
141.7     *       *      * /ifs/
127.8     *       *      * /ifs/.ifsvar
 86.3      *      *      * /ifs/.ifsvar/modules
 81.7      *      *      * SYSTEM (0x0)
 33.3      *      *      * /ifs/.ifsvar/modules/tardis
 28.6      *      *      * /ifs/.ifsvar/modules/tardis/gconfig
 28.3      *      *      * /ifs/.ifsvar/upgrade
 13.1      *      *      * /ifs/.ifsvar/upgrade/logs/UpgradeLog-1.db
 11.9      *      *      * /ifs/.ifsvar/modules/tardis/namespaces/healthcheck_schedules.sqlite
 10.5      *      *      * /ifs/.ifsvar/modules/cloud

Once you have all this information, you can now find the user or process (based on IP, UserName, and so on) and figure out what that user is doing and what's causing the render to fail or high I/O generation. In many situations, it will be an asset that is either sitting on a lower-performance tier of the cluster or, if you're using a front side render cache, an asset that is sitting outside of the pre-cached path, so the spindles in the cluster are taking the I/O hit.

For more tips and tricks that can help to save you valuable time, keep checking back. In the meantime, if you have any questions, please feel free to get in touch and I'll do my best to help!

Author: Andy Copeland
Media & Entertainment Solutions Architect

Your Browser is Out of Date

Diary of a VFX Systems Engineer—Part 1: isi Statistics

isi statistics system --n=all --format=top

isi statistics client list --totalby=UserName --sort=Ops

isi statistics client --UserName=<username> --sort=Ops

isi statistics heat list --totalby=path --sort=Ops | head -12