
OneFS Smartfail
Mon, 27 Jun 2022 21:03:17 -0000
|Read Time: 0 minutes
OneFS protects data stored on failing nodes or drives in a cluster through a process called smartfail. During the process, OneFS places a device into quarantine and, depending on the severity of the issue, the data on it into a read-only state. While a device is quarantined, OneFS reprotects the data on the device by distributing the data to other devices.
After all data eviction or reconstruction is complete, OneFS logically removes the device from the cluster, and the node or drive can be physically replaced. OneFS only automatically smartfails devices as a last resort. Nodes and/or drives can also be manually smartfailed. However, it is strongly recommended to first consult Dell Technical Support.
Occasionally a device might fail before OneFS detects a problem. If a drive fails without being smartfailed, OneFS automatically starts rebuilding the data to available free space on the cluster. However, because a node might recover from a transient issue, if a node fails, OneFS does not start rebuilding data unless it is logically removed from the cluster.
A node that is unavailable and reported by isi status as ‘D’, or down, can be smartfailed. If the node is hard down, likely with a significant hardware issue, the smartfail process will take longer because data has to be recalculated from the FEC protection parity blocks. That said, it’s well worth attempting to bring the node up if at all possible – especially if the cluster, and/or node pools, is at the default +2D:1N protection. The concern here is that, with a node down, there is a risk of data loss if a drive or other component goes bad during the smartfail process.
If possible, and assuming the disk content is still intact, it can often be quicker to have the node hardware repaired. In this case, the entire node’s chassis (or compute module in the case of Gen 6 hardware) could be replaced and the old disks inserted with original content on them. This should only be performed at the recommendation and under the supervision of Dell Technical Support. If the node is down because of a journal inconsistency, it will have to be smartfailed out. In this case, engage Dell Technical Support to determine an appropriate action plan.
The recommended procedure for smartfailing a node is as follows. In this example, we’ll assume that node 4 is down:
From the CLI of any node except node 4, run the following command to smartfail out the node:
# isi devices node smartfail --node-lnn 4
Verify that the node is removed from the cluster.
# isi status –q
(An ‘—S-’ will appear in node 4’s ‘Health’ column to indicate it has been smartfailed).
Monitor the successful completion of the job engine’s MultiScan, FlexProtect/FlexProtectLIN jobs:
# isi job status
Un-cable and remove the node from the rack for disposal.
As mentioned previously, there are two primary Job Engine jobs that run as a result of a smartfail:
- MultiScan
- FlexProtect or FlexProtectLIN
MultiScan performs the work of both the AutoBalance and Collect jobs simultaneously, and it is triggered after every group change. The reason is that new file layouts and file deletions that happen during a disruption to the cluster might be imperfectly balanced or, in the case of deletions, simply lost.
The Collect job reclaims free space from previously unavailable nodes or drives. A mark and sweep garbage collector, it identifies everything valid on the filesystem in the first phase. In the second phase, the Collect job scans the drives, freeing anything that isn’t marked valid.
When node and drive usage across the cluster are out of balance, the AutoBalance job scans through all the drives looking for files to re-layout, to make use of the less filled devices.
The purpose of the FlexProtect job is to scan the file system after a device failure to ensure that all files remain protected. Incomplete protection levels are fixed, in addition to missing data or parity blocks caused by drive or node failures. This job is started automatically after smartfailing a drive or node. If a smartfailed device was the reason the job started, the device is marked gone (completely removed from the cluster) at the end of the job.
Although a new node can be added to a cluster at any time, it’s best to avoid major group changes during a smartfail operation. This helps to avoid any unnecessary interruptions of a critical job engine data reprotection job. However, because a node is down, there is a window of risk while the cluster is rebuilding the data from that cluster. Under pressing circumstances, the smartfail operation can be paused, the node added, and then smartfail resumed when the new node has happily joined the cluster.
Be aware that if the node you are adding is the same node that was smartfailed, the cluster maintains a record of that node and may prevent the re-introduction of that node until the smartfail is complete. To mitigate risk, Dell Technical Support should definitely be involved to ensure data integrity.
The time for a smartfail to complete is hard to predict with any accuracy, and depends on:
Attribute | Description |
OneFS release | Determines OneFS job engine version and how efficiently it operates. |
System hardware | Drive types, CPU, RAM, and so on. |
File system | Quantity and type of data (that is, small vs. large files), protection, tunables, and so on. |
Cluster load | Processor and CPU utilization, capacity utilization, and so on. |
Typical smartfail runtimes range from minutes (for fairly empty, idle nodes with SSD and SAS drives) to days (for nodes with large SATA drives and a high capacity utilization). The FlexProtect job already runs at the highest job engine priority (value=1) and medium impact by default. As such, there isn’t much that can be done to speed up this job, beyond reducing cluster load.
Smartfail is also a valuable tool for proactive cluster node replacement, such as during a hardware refresh. Provided that the cluster quorum is not broken, a smartfail can be initiated on multiple nodes concurrently – but never more than n/2 – 1 nodes (rounded up)!
If replacing an entire node pool as part of a tech refresh, a SmartPools filepool policy can be crafted to migrate the data to another node pool across the backend network. When complete, the nodes can then be smartfailed out, which should progress swiftly because they are now empty.
If multiple nodes are smartfailed simultaneously, at the final stage of the process the node remove is serialized with roughly a 60 second pause between each. The smartfail job places the selected nodes in read-only mode while it copies the protection stripes to the cluster’s free space. Using SmartPools to evacuate data from a node or set of nodes, in preparation to remove them, is generally a good idea, and usually a relatively fast process.
SmartPools’ Virtual Hot Spare (VHS) functionality helps ensure that node pools maintain enough free space to successfully re-protect data in the event of a smartfail. Though configured globally, VHS actually operates at the node pool level so that nodes with different size drives reserve the appropriate VHS space. This helps ensure that while data may move from one disk pool to another during repair, it remains on the same class of storage. VHS reservations are cluster wide and configurable, as either a percentage of total storage (0-20%), or as a number of virtual drives (1-4), with the default being 10%.
Note: a smartfail is not guaranteed to remove all data on a node. Any pool in a cluster that’s flagged with the ‘System’ flag can store /ifs/.ifsvar data. A filepool policy to move the regular data won’t address this data. Also, because SmartPools ‘spillover’ may have occurred at some point, there is no guarantee that an ‘empty’ node is completely devoid of data. For this reason, OneFS still has to search the tree for files that may have blocks residing on the node.
Nodes can be easily smartfailed from the OneFS WebUI by navigating to Cluster Management > Hardware Configuration and selecting ‘Actions > More > Smartfail Node’ for the desired node(s):
Similarly, the following CLI commands first initiate and then halt the node smartfail process, respectively. First, the ‘isi devices node smartfail’ command kicks off the smartfail process on a node and removes it from the cluster.
# isi devices node smartfail -h Syntax # isi devices node smartfail [--node-lnn <integer>] [--force | -f] [--verbose | -v]
If necessary, the ‘isi devices node stopfail’ command can be used to discontinue the smartfail process on a node.
# isi devices node stopfail -h Syntax isi devices node stopfail [--node-lnn <integer>] [--force | -f] [--verbose | -v]
Similarly, individual drives within a node can be smartfailed with the ‘isi devices drive smartfail’ CLI command.
# isi devices drive smartfail { <bay> | --lnum <integer> | --sled <string> } [--node-lnn <integer>] [{--force | -f}] [{--verbose | -v}] [{--help | -h}]
Author: Nick Trimbee
Related Blog Posts

PowerScale OneFS Release 9.3 now supports Secure Boot
Fri, 22 Oct 2021 20:50:20 -0000
|Read Time: 0 minutes
Many organizations are looking for ways to further secure systems and processes in today's complex security environments. The grim reality is that a device is typically most susceptible to loading malicious malware during its boot sequence.
With the introduction of OneFS 9.3, the UEFI Secure Boot feature is now supported on Isilon A2000 nodes. Not only does the release support the UEFI Secure Boot feature, but OneFS goes a step further by adding FreeBSD’s signature validation. Combining UEFI Secure Boot and FreeBSD’s signature validation helps protect the boot process from potential malware attacks.
The Unified Extensible Firmware Interface (UEFI) Forum standardizes and secures the boot sequence across devices with the UEFI specification. UEFI Secure Boot was introduced in UEFI 2.3.1, allowing only authorized EFI binaries to load.
FreeBSD’s veriexec function is used to perform signature validation for the boot loader and kernel. In addition, the PowerScale Secure Boot feature runs during the node’s bootup process only, using public-key cryptography to verify the signed code, to ensure that only trusted code is loaded on the node.
The Secure Boot feature does not impact cluster performance because the feature is only executed at bootup.
Pre-requisites
The OneFS Secure Boot feature is only supported on Isilon A2000 nodes at this time. The cluster must be upgraded and committed to OneFS 9.3. After the release is committed, proceed with upgrading the Node Firmware Package to 11.3 or higher.
Considerations
PowerScale nodes are not shipped with the Secure Boot feature enabled. The feature must be enabled on each node manually in a cluster. Now, a mixed cluster is supported where some nodes have the Secure Boot feature enabled, and others have it disabled.
A license is not required for the PowerScale Secure Boot feature. The Secure Boot feature can be enabled and disabled at any point, but it requires a maintenance window to reboot the node.
Configuration
You can use IPMI or the BIOS to enable the PowerScale Secure Boot feature, but disabling the feature requires using the BIOS.
For more information about the PowerScale Secure Boot feature, and detailed configuration steps, see the Dell EMC PowerScale OneFS Secure Boot white paper.
For more great information about PowerScale, see the PowerScale Info Hub at: https://infohub.delltechnologies.com/t/powerscale-isilon-1/.
Author: Aqib Kazi

OneFS Firewall Management and Troubleshooting
Thu, 25 May 2023 14:41:59 -0000
|Read Time: 0 minutes
In the final blog in this series, we’ll focus on step five of the OneFS firewall provisioning process and turn our attention to some of the management and monitoring considerations and troubleshooting tools associated with the firewall.
One can manage and monitor the firewall in OneFS 9.5 using the CLI, platform API, or WebUI. Because data security threats come from inside an environment as well as out, such as from a rogue IT employee, a good practice is to constrain the use of all-powerful ‘root’, ‘administrator’, and ‘sudo’ accounts as much as possible. Instead of granting cluster admins full rights, a preferred approach is to use OneFS’ comprehensive authentication, authorization, and accounting framework.
OneFS role-based access control (RBAC) can be used to explicitly limit who has access to configure and monitor the firewall. A cluster security administrator selects the desired access zone, creates a zone-aware role within it, assigns privileges, and then assigns members. For example, from the WebUI under Access > Membership and roles > Roles:
When these members login to the cluster from a configuration interface (WebUI, Platform API, or CLI) they inherit their assigned privileges.
Accessing the firewall from the WebUI and CLI in OneFS 9.5 requires the new ISI_PRIV_FIREWALL administration privilege.
# isi auth privileges -v | grep -i -A 2 firewall ID: ISI_PRIV_FIREWALL Description: Configure network firewall Name: Firewall Category: Configuration Permission: w
This privilege can be assigned one of four permission levels for a role, including:
Permission Indicator | Description |
– | No permission. |
R | Read-only permission. |
X | Execute permission. |
W | Write permission. |
By default, the built-in ‘SystemAdmin’ roles is granted write privileges to administer the firewall, while the built-in ‘AuditAdmin’ role has read permission to view the firewall configuration and logs.
With OneFS RBAC, an enhanced security approach for a site could be to create two additional roles on a cluster, each with an increasing realm of trust. For example:
1. An IT ops/helpdesk role with ‘read’ access to the snapshot attributes would permit monitoring and troubleshooting the firewall, but no changes:
RBAC Role | Firewall Privilege | Permission |
IT_Ops | ISI_PRIV_FIREWALL | Read |
For example:
# isi auth roles create IT_Ops # isi auth roles modify IT_Ops --add-priv-read ISI_PRIV_FIREWALL # isi auth roles view IT_Ops | grep -A2 -i firewall ID: ISI_PRIV_FIREWALL Permission: r
2. A Firewall Admin role would provide full firewall configuration and management rights:
RBAC Role | Firewall Privilege | Permission |
FirewallAdmin | ISI_PRIV_FIREWALL | Write |
For example:
# isi auth roles create FirewallAdmin # isi auth roles modify FirewallAdmin –add-priv-write ISI_PRIV_FIREWALL # isi auth roles view FirewallAdmin | grep -A2 -i firewall ID: ISI_PRIV_FIREWALL Permission: w
Note that when configuring OneFS RBAC, remember to remove the ‘ISI_PRIV_AUTH’ and ‘ISI_PRIV_ROLE’ privilege from all but the most trusted administrators.
Additionally, enterprise security management tools such as CyberArk can also be incorporated to manage authentication and access control holistically across an environment. These can be configured to change passwords on trusted accounts frequently (every hour or so), require multi-Level approvals prior to retrieving passwords, and track and audit password requests and trends.
OneFS firewall limits
When working with the OneFS firewall, there are some upper bounds to the configurable attributes to keep in mind. These include:
Name | Value | Description |
MAX_INTERFACES | 500 | Maximum number of L2 interfaces including Ethernet, VLAN, LAGG interfaces on a node. |
MAX _SUBNETS | 100 | Maximum number of subnets within a OneFS cluster |
MAX_POOLS | 100 | Maximum number of network pools within a OneFS cluster |
DEFAULT_MAX_RULES | 100 | Default value of maximum rules within a firewall policy |
MAX_RULES | 200 | Upper limit of maximum rules within a firewall policy |
MAX_ACTIVE_RULES | 5000 | Upper limit of total active rules across the whole cluster |
MAX_INACTIVE_POLICIES | 200 | Maximum number of policies that are not applied to any network subnet or pool. They will not be written into ipfw tables. |
Firewall performance
Be aware that, while the OneFS firewall can greatly enhance the network security of a cluster, by nature of its packet inspection and filtering activity, it does come with a slight performance penalty (generally less than 5%).
Firewall and hardening mode
If OneFS STIG hardening (that is, from ‘isi hardening apply’) is applied to a cluster with the OneFS firewall disabled, the firewall will be automatically activated. On the other hand, if the firewall is already enabled, then there will be no change and it will remain active.
Firewall and user-configurable ports
Some OneFS services allow the TCP/UDP ports on which the daemon listens to be changed. These include:
Service | CLI Command | Default Port |
NDMP | isi ndmp settings global modify –port | 10000 |
S3 | isi s3 settings global modify –https-port | 9020, 9021 |
SSH | isi ssh settings modify –port | 22 |
The default ports for these services are already configured in the associated global policy rules. For example, for the S3 protocol:
# isi network firewall rules list | grep s3 default_pools_policy.rule_s3 55 Firewall rule on s3 service allow # isi network firewall rules view default_pools_policy.rule_s3 ID: default_pools_policy.rule_s3 Name: rule_s3 Index: 55 Description: Firewall rule on s3 service Protocol: TCP Dst Ports: 9020, 9021 Src Networks: - Src Ports: - Action: allow
Note that the global policies, or any custom policies, do not auto-update if these ports are reconfigured. This means that the firewall policies must be manually updated when changing ports. For example, if the NDMP port is changed from 10000 to 10001:
# isi ndmp settings global view Service: False Port: 10000 DMA: generic Bre Max Num Contexts: 64 MSB Context Retention Duration: 300 MSR Context Retention Duration: 600 Stub File Open Timeout: 15 Enable Redirector: False Enable Throttler: False Throttler CPU Threshold: 50 # isi ndmp settings global modify --port 10001 # isi ndmp settings global view | grep -i port Port: 10001
The firewall’s NDMP rule port configuration must also be reset to 10001:
# isi network firewall rule list | grep ndmp default_pools_policy.rule_ndmp 44 Firewall rule on ndmp service allow # isi network firewall rule modify default_pools_policy.rule_ndmp --dst-ports 10001 --live # isi network firewall rule view default_pools_policy.rule_ndmp | grep -i dst Dst Ports: 10001
Note that the –live flag is specified to enact this port change immediately.
Firewall and source-based routing
Under the hood, OneFS source-based routing (SBR) and the OneFS firewall both leverage ‘ipfw’. As such, SBR and the firewall share the single ipfw table in the kernel. However, the two features use separate ipfw table partitions.
This allows SBR and the firewall to be activated independently of each other. For example, even if the firewall is disabled, SBR can still be enabled and any configured SBR rules displayed as expected (that is, using ipfw set 0 show).
Firewall and IPv6
Note that the firewall’s global default policies have a rule allowing ICMP6 by default. For IPv6 enabled networks, ICMP6 is critical for the functioning of NDP (Neighbor Discovery Protocol). As such, when creating custom firewall policies and rules for IPv6-enabled network subnets/pools, be sure to add a rule allowing ICMP6 to support NDP. As discussed in a previous blog, an alternative (and potentially easier) approach is to clone a global policy to a new one and just customize its ruleset instead.
Firewall and FTP
The OneFS FTP service can work in two modes: Active and Passive. Passive mode is the default, where FTP data connections are created on top of random ephemeral ports. However, because the OneFS firewall requires fixed ports to operate, it only supports the FTP service in Active mode. Attempts to enable the firewall with FTP running in Passive mode will generate the following warning:
# isi ftp settings view | grep -i active Active Mode: No # isi network firewall settings modify --enabled yes FTP service is running in Passive mode. Enabling network firewall will lead to FTP clients having their connections blocked. To avoid this, please enable FTP active mode and ensure clients are configured in active mode before retrying. Are you sure you want to proceed and enable network firewall? (yes/[no]):
To activate the OneFS firewall in conjunction with the FTP service, first ensure that the FTP service is running in Active mode before enabling the firewall. For example:
# isi ftp settings view | grep -i enable FTP Service Enabled: Yes # isi ftp settings view | grep -i active Active Mode: No # isi ftp setting modify –active-mode true # isi ftp settings view | grep -i active Active Mode: Yes # isi network firewall settings modify --enabled yes
Note: Verify FTP active mode support and/or firewall settings on the client side, too.
Firewall monitoring and troubleshooting
When it comes to monitoring the OneFS firewall, the following logfiles and utilities provide a variety of information and are a good source to start investigating an issue:
Utility | Description |
/var/log/isi_firewall_d.log | Main OneFS firewall log file, which includes information from firewall daemon. |
/var/log/isi_papi_d.log | Logfile for platform AP, including Firewall related handlers. |
isi_gconfig -t firewall | CLI command that displays all firewall configuration info. |
ipfw show | CLI command that displays the ipfw table residing in the FreeBSD kernel. |
Note that the preceding files and command output are automatically included in logsets generated by the ‘isi_gather_info’ data collection tool.
You can run the isi_gconfig command with the ‘-q’ flag to identify any values that are not at their default settings. For example, the stock (default) isi_firewall_d gconfig context will not report any configuration entries:
# isi_gconfig -q -t firewall [root] {version:1}
The firewall can also be run in the foreground for additional active rule reporting and debug output. For example, first shut down the isi_firewall_d service:
# isi services -a isi_firewall_d disable The service 'isi_firewall_d' has been disabled.
Next, start up the firewall with the ‘-f’ flag.
# isi_firewall_d -f Acquiring kevents for flxconfig Acquiring kevents for nodeinfo Acquiring kevents for firewall config Initialize the firewall library Initialize the ipfw set ipfw: Rule added by ipfw is for temporary use and will be auto flushed soon. Use isi firewall instead. cmd:/sbin/ipfw set enable 0 normal termination, exit code:0 isi_firewall_d is now running Loaded master FlexNet config (rev:312) Update the local firewall with changed files: flx_config, Node info, Firewall config Start to update the firewall rule... flx_config version changed! latest_flx_config_revision: new:312, orig:0 node_info version changed! latest_node_info_revision: new:1, orig:0 firewall gconfig version changed! latest_fw_gconfig_revision: new:17, orig:0 Start to update the firewall rule for firewall configuration (gconfig) Start to handle the firewall configure (gconfig) Handle the firewall policy default_pools_policy ipfw: Rule added by ipfw is for temporary use and will be auto flushed soon. Use isi firewall instead. 32043 allow tcp from any to any 10000 in cmd:/sbin/ipfw add 32043 set 8 allow TCP from any to any 10000 in normal termination, exit code:0 ipfw: Rule added by ipfw is for temporary use and will be auto flushed soon. Use isi firewall instead. 32044 allow tcp from any to any 389,636 in cmd:/sbin/ipfw add 32044 set 8 allow TCP from any to any 389,636 in normal termination, exit code:0 Snip...
If the OneFS firewall is enabled and some network traffic is blocked, either this or the ipfw show CLI command will often provide the first clues.
Please note that the ipfw command should NEVER be used to modify the OneFS firewall table!
For example, say a rule is added to the default pools policy denying traffic on port 9876 from all source networks (0.0.0.0/0):
# isi network firewall rules create default_pools_policy.rule_9876 --index=100 --dst-ports 9876 --src-networks 0.0.0.0/0 --action deny –live # isi network firewall rules view default_pools_policy.rule_9876 ID: default_pools_policy.rule_9876 Name: rule_9876 Index: 100 Description: Protocol: ALL Dst Ports: 9876 Src Networks: 0.0.0.0/0 Src Ports: - Action: deny
Running ipfw show and grepping for the port will show this new rule:
# ipfw show | grep 9876 32099 0 0 deny ip from any to any 9876 in
The ipfw show command output also reports the statistics of how many IP packets have matched each rule This can be incredibly useful when investigating firewall issues. For example, a telnet session is initiated to the cluster on port 9876 from a client:
# telnet 10.224.127.8 9876 Trying 10.224.127.8... telnet: connect to address 10.224.127.8: Operation timed out telnet: Unable to connect to remote host
The connection attempt will time out because the port 9876 ‘deny’ rule will silently drop the packets. At the same time, the ipfw show command will increment its counter to report on the denied packets. For example:
# ipfw show | grep 9876 32099 9 540 deny ip from any to any 9876 in
If this behavior is not anticipated or desired, you can find the rule name by searching the rules list for the port number, in this case port 9876:
# isi network firewall rules list | grep 9876 default_pools_policy.rule_9876 100 deny
The offending rule can then be reverted to ‘allow’ traffic on port 9876:
# isi network firewall rules modify default_pools_policy.rule_9876 --action allow --live
Or easily deleted, if preferred:
# isi network firewall rules delete default_pools_policy.rule_9876 --live Are you sure you want to delete firewall rule default_pools_policy.rule_9876? (yes/[no]): yes
Author: Nick Trimbee