A semiconductor company uses a large HPC compute cluster for parts of their EDA workflow, and wants to guard against runaway jobs for consuming massive amounts of storage. The company runs heavy computations jobs from a large compute farm against a ‘scratch space’ directory, housed on a performance tier on their cluster, and garbage collection is run at midnight.
Throughout the workday, it is hard for the storage admins to keep track of storage utilization. Occasionally, jobs from the compute farm run amok, tying up large swathes of fast, expensive storage resources and capacity. To help prevent this, the storage administrator:
- Sets an advisory directory quota on the scratch space at 80% utilization for advanced warning of an issue.
- Configures a hard directory quota to prevent writes at 90% utilization.
SmartQuotas best practices include:
- Leverage default quotas for ease of deployment and management at large scale. Configuration changes for linked quotas must be made on the parent quota that the linked quota is inheriting from. Changes to the parent quota are propagated to all children. To override configuration from the parent quota, you must unlink the quota first.
- Where possible, observe the best practice of a maximum number of 500,000 quotas per cluster in OneFS 8.2 and later, and 20,000 quotas per cluster in prior releases.
- Avoid creating quotas on the root directory of the default OneFS share (/ifs). A root-level quota may result in performance degradation.
- Limit quota depth to a maximum of 275 directories.
- Governing a single directory with overlapping quotas can also degrade performance.
- Directory quotas can also be used to alert of and constrain runaway jobs, preventing them from consuming massive amounts of storage space.
- The maximum tested quota limit is 400,000 (although the file system has no hard-coded limit on quotas). However, when listing many quotas, only a partial list may be returned.
- With CloudPools data, the quota is calculated based on the size of the data local to the cluster. For example, for a 100MB file tiered to a cloud provider, SmartQuotas would calculate just the size of the local stub file (8K).
- If two quotas are created on the same directory – for example an accounting quota without Snapshots and a hard quota with Snapshots - the quota without Snapshot data overrules the limit from the quota with Snapshot data.
- • SmartQuotas also provide a low impact way to provide directory file count reports.
- A quota can only be unlinked when it is linked to a default quota. Configuration changes for linked quotas must be made on the parent (default) quota that the linked quota is inheriting from. Changes to the parent quota are propagated to all children. If you want to override configuration from the parent quota, you must first unlink the quota.
- Quota containers compartmentalize /ifs, so that a directory with a container will appear as its own separate ‘file system slice’. For example, to configure a directory quota with a 4TB container on /ifs/data/container1, you could use the following CLI command:
# isi quota quotas create /ifs/data/container1 directory --hard-threshold 4T --container true
Further information is available in the OneFS SmartQuotas white paper.
In summary, best practices on planning and managing capacity on a large cluster include the following:
- To risk manage the possibility of adverse data delivery at very high capacity levels, we recommend a maximum of 85% capacity utilization (also stated as a Reserve Capacity of 15%) on large clusters.
- At any workload level, do not exceed 90% capacity utilization.
- Examine data delivery variance at 75% and again at 80% capacity utilization; evaluate consistency of data delivery before using additional capacity.
- Consider a buffer for maintenance actions (disk rebuild) when planning reserve capacity.
- Maintain sufficient free space.
- Plan for contingencies.
- Manage your data.
- Maintain appropriate protection levels.
- Monitor cluster capacity and data ingest rate.
- Consider configuring a separate accounting quota for ‘/ifs/home’ and ‘/ifs/data directories’ (or wherever data and home directories are provisioned) to monitor aggregate disk space usage and issue administrative alerts as necessary to avoid running low on overall capacity.
- Ensure that any snapshot, replication, and backup schedules meet the required availability and recovery objectives and fit within the overall capacity.
- Carefully manage snapshot creation and deletion schedules.
- Leverage SmartQuotas to understand, predict, control and limit storage usage.
- Use InsightIQ, CloudIQ, and DataIQ for monitoring, usage forecasting, and verifying cluster health.