Arguably, the most important metric to measure for performance issues is the protocol latency to the clients. Regardless of what else is happening on the cluster, poor performance in any subsystem will most often show up as higher operations latency to the client.
It must be stressed that, when measuring protocol latency through, you must perform a breakout by the protocol or protocols used by your clients. Each protocol has an acceptable range of latencies due to the protocol itself, and these latencies vary greatly. The default view for protocol latency is an aggregate of all protocols on the cluster. This aggregate may not provide a good representation of true performance as the latency values of all the protocols mix. As an example, it is not uncommon for the PAPI protocol (the protocol used by InsightIQ) to take several seconds to return results for a large system-statistics request. If your cluster is not heavily utilized, due to the impact of PAPI, your average protocol latency at a cluster level could be in the tens of milliseconds. This result can occur even if your NFS operations are on average less than 4 ms.
In addition to breaking out latency by protocol, it is often useful to break out by the operation type and direction as well. This action provides information about what kind of operations are taking more time and where to start assessing and investigating performance issues. Often, an NFS commit operation, for example, has a higher latency because data in the write coalescer must be written to stable storage before the operation is considered complete. In general, read operations will have a lower latency than write operations for both data and metadata operations.
Table 1. Protocol latency
Protocol |
High performance |
Medium performance |
General performance |
SMB |
<= 10 ms |
<= 15 ms |
<= 20 ms |
NFS |
<= 5 ms |
<= 10 ms |
<= 20 ms |
HDFS |
N/A |
N/A |
<= 10 seconds |
For the general guidelines described in Table 1, the latency is an average of all the different operation types for a given protocol. User home directories, many applications, and backup targets generally fall in the general performance category. Some applications require higher performance and can be considered a medium-performance workflow. A high-performance workload would encompass work such as high-performance computing (HPC), simulation modeling, and large software development.
These are general guidelines. Your individual workflow may require lower latency than the examples presented, or it can tolerate higher latencies than those presented here.