In a storage system, one of the most critical components is the disk subsystem. Regardless of the storage medium, understanding when the disk subsystem is the bottleneck in a system is critical.
The disk I/O operations count metric is measured at the file system level in OneFS and represents the number of I/Os this layer sends to the disk subsystem. This means that the reported number does not necessarily represent the number of transactions the hard drive is servicing, as is often reported by block systems. The real number of disk transactions per second can be much lower than the number reported by the file system. This result is due to coalescing of multiple requests into a fewer request and other disk-subsystem-specific features. For these reasons, the disk I/O operations count metric is not necessarily a good indicator of a drive’s ability to sustain a workload.
One of the important disk metrics is the Average Pending Disk Operations Count (also known as Queued). This metric represents the number of I/O operations that are pending at the file system level that are waiting to be issued to an individual drive.
A second metric that is important is the Pending Disk Operations Latency (also known as Time in Queue). This metric reports the average time in milliseconds that an I/O request waits in a queue before being issued to the drive. This metric is not available in the default InsightIQ disk performance report, but it is available when creating a custom performance report.
The Average Pending Disk Operations Count metric should be paired with the Pending Disk Operations Latency metric when trying to troubleshoot performance problems. If you multiply these two numbers together for a given drive, you get an indication of the time required before the drive is ready to service a new request. Without the Pending Disk Operations Latency metric, you can still use the absolute number of queued disk I/Os to roughly estimate when a disk is overworked.
While large values for the queued operations can indicate a problem, the number needs to be sustained for a long time. Short spikes on the order of minutes to an hour that exceed the recommended values are an indication of a spike in workload and are not uncommon. It is the long term sustained high pending operations count that should be considered when looking for disk saturation.
The example below shows how averages can be misleading in terms of identifying a problem. The large chart shows the average of all disks across all nodes in the cluster. Overall, the disks are not busy with an average count under 1. However, when you go into a detailed breakout, you can see that the drives in two nodes (node 104 and node 5) have high pending operations counts. Node 104 drive D0 has a value of 42.889 pending operations, which is quite high.
Figure 3. Average pending disk operations count
For most workloads, a value for the disk I/O queue time of < 1 millisecond is ideal, a value < 4 milliseconds is good, and a value < 8 milliseconds is acceptable. Numbers >= 8 milliseconds are a sign that there may not be enough disk to serve the workload if sustained for long periods of time. If you do not have the average time in queue available, assume 1 millisecond when calculating the disk I/O queue time.
Table 3. Disk I/O
General |
Ideal |
Good |
Acceptable |
Disk I/O queue time |
<= 1 ms |
<= 4 ms |
<= 8 ms |