OneFS Log Gather Transmission
Wed, 17 Apr 2024 15:45:51 -0000
|Read Time: 0 minutes
The OneFS isi_gather_info utility is the ubiquitous method for collecting and uploading a PowerScale cluster’s context and configuration to assist with the identification and resolution of bugs and issues. As such, it performs the following roles:
- Executes many commands, scripts, and utilities on a cluster, and saves their results
- Collates, or gathers, all these files into a single ‘gzipped’ package
- Optionally transmits this log gather package back to Dell using a choice of several transport methods
By default, a log gather tarfile is written to the /ifs/data/Isilon_Support/pkg/ directory. It can also be uploaded to Dell by the following means:
Upload mechanism | Description | TCP port | OneFS release support |
SupportAssist / ESRS | Uses Dell Secure Remote Support (SRS) for gather upload. | 443/8443 | Any |
FTP | Use FTP to upload the completed gather. | 21 | Any |
FTPS | Use SSH-based encrypted FTPS to upload the gather. | 22 | Default in OneFS 9.5 and later |
HTTP | Use HTTP to upload the gather. | 80/443 | Any |
As indicated in this table, OneFS 9.5 and later releases now leverage FTPS as the default option for FTP upload, thereby protecting the upload of cluster configuration and logs with an encrypted transmission session.
Under the hood, the log gather process comprises an eight phase workflow, with transmission comprising the penultimate ‘Upload’ phase:
The details of each phase are as follows:
Phase | Description |
1. Setup | Reads from the arguments passed in, and from any config files on disk, and sets up the config dictionary, which will be used throughout the rest of the codebase. Most of the code for this step is contained in isilon/lib/python/gather/igi_config/configuration.py. This is also the step in which the program is most likely to exit, if some config arguments end up being invalid. |
2. Run local | Executes all the cluster commands, which are run on the same node that is starting the gather. All these commands run in parallel (up to the current parallelism value). This is typically the second longest running phase. |
3. Run nodes | Executes the node commands across all of the cluster’s nodes. This runs on each node, and while these commands run in parallel (up to the current parallelism value), they do not run in parallel with the ‘Run local’ step. |
4. Collect | Ensures that all of the results end up on the overlord node (the node that started the gather). If the gather is using /ifs, it is very fast; if it is not using /ifs, it needs to SCP all the node results to a single node. |
5. Generate Extra Files | Generates nodes_info.xml and package_info.xml. These two files are present in every gather, and provide important metadata about the cluster. |
6. Packing | Packs (tars and gzips) all the results. This is typically the longest running phase, often by an order of magnitude. |
7. Upload | Transports the tarfile package to its specified destination using SupportAssist, ESRS, FTPS, FTP, HTTP, and so on. Depending on the geographic location, this phase might also be lengthy. |
8. Cleanup | Cleans up any intermediary files that were created on the cluster. This phase will run even if the gather fails, or is interrupted. |
Because the isi_gather_info tool is primarily intended for troubleshooting clusters with issues, it runs as root (or compadmin in compliance mode), because it needs to be able to execute under degraded conditions (such as without GMP, during upgrade, and under cluster splits, and so on). Given these atypical requirements, isi_gather_info is built as a standalone utility, rather than using the platform API for data collection.
While FTPS is the new default and recommended transport, the legacy plaintext FTP upload method is still available in OneFS 9.5 and later. As such, Dell’s log server, ftp.isilon.com, also supports both encrypted FTPS and plaintext FTP, so will not impact older release FTP log upload behavior.
This OneFS 9.5 FTPS security enhancement encompasses three primary areas where an FTPS option is now supported:
- Directly executing the /usr/bin/isi_gather_info utility
- Running using the isi diagnostics gather CLI command set
- Creating a diagnostics gather through the OneFS WebUI
For the isi_gather_info utility, two new options are included in OneFS 9.5 and later releases:
New option for isi_gather_info | Description | Default value |
--ftp-insecure | Enables the gather to use unencrypted FTP transfer. | False |
--ftp-ssl-cert | Enables the user to specify the location of a special SSL certificate file. | Empty string. Not typically required. |
Similarly, there are two corresponding options in OneFS 9.5 and later for the isi diagnostics CLI command:
New option for isi diagnostics | Description | Default value |
--ftp-upload-insecure | Enables the gather to use unencrypted FTP transfer. | No |
--ftp-upload-ssl-cert | Enables the user to specify the location of a special SSL certificate file. | Empty string. Not typically required. |
Based on these options, the following table provides some command syntax usage examples, for both FTPS and FTP uploads:
FTP upload type | Description | Example isi_gather_info syntax | Example isi diagnostics syntax |
Secure upload (default) | Upload cluster logs to the Dell log server (ftp.isilon.com) using encrypted FTP (FTPS). | # isi_gather_info Or # isi_gather_info --ftp | # isi diagnostics gather start Or # isi diagnostics gather start --ftp-upload-insecure=no |
Secure upload | Upload cluster logs to an alternative server using encrypted FTPS. | # isi_gather_info --ftp-host <FQDN> --ftp-ssl-cert <SSL_CERT_PATH> | # isi diagnostics gather start --ftp-upload-host=<FQDN> --ftp-ssl-cert= <SSL_CERT_PATH> |
Unencrypted upload | Upload cluster logs to the Dell log server (ftp.isilon.com) using plaintext FTP. | # isi_gather_info --ftp-insecure | # isi diagnostics gather start --ftp-upload-insecure=yes |
Unencrypted upload | Upload cluster logs to an alternative server using plaintext FTP. | # isi_gather_info --ftp-insecure --ftp-host <FQDN> | # isi diagnostics gather start --ftp-upload-host=<FQDN> --ftp-upload-insecure=yes |
Note that OneFS 9.5 and later releases provide a warning if the cluster admin elects to continue using non-secure FTP for the isi_gather_info tool. Specifically, if the --ftp-insecure option is configured, the following message is displayed, informing the user that plaintext FTP upload is being used, and that the connection and data stream will not be encrypted:
# isi_gather_info --ftp-insecure
You are performing plain text FTP logs upload.
This feature is deprecated and will be removed
in a future release. Please consider the possibility
of using FTPS for logs upload. For further information,
please contact PowerScale support
...
In addition to the command line, log gathers can also be configured using the OneFS WebUI by navigating to Cluster management > Diagnostics > Gather settings.
The Edit gather settings page in OneFS 9.5 and later has been updated to reflect FTPS as the default transport method, plus the addition of radio buttons and text boxes to accommodate the new configuration options.
If plaintext FTP upload is configured, the healthcheck command will display a warning that plaintext upload is used and is no longer a recommended option. For example:
For reference, the OneFS 9.5 and later isi_gather_info CLI command syntax includes the following options:
Option | Description |
--upload <boolean> | Enable gather upload. |
--esrs <boolean> | Use ESRS for gather upload. |
--noesrs | Do not attempt to upload using ESRS. |
--supportassist | Attempt SupportAssist upload. |
--nosupportassist | Do not attempt to upload using SupportAssist. |
--gather-mode (incremental | full) | Type of gather: incremental or full. |
--http-insecure <boolean> | Enable insecure HTTP upload on completed gather. |
--http-host <string> | HTTP Host to use for HTTP upload. |
--http-path <string> | Path on HTTP server to use for HTTP upload. |
--http-proxy <string> | Proxy server to use for HTTP upload. |
--http-proxy-port <integer> | Proxy server port to use for HTTP upload. |
--ftp <boolean> | Enable FTP upload on completed gather. |
--noftp | Do not attempt FTP upload. |
--set-ftp-password | Interactively specify alternate password for FTP. |
--ftp-host <string> | FTP host to use for FTP upload. |
--ftp-path <string> | Path on FTP server to use for FTP upload. |
--ftp-port <string> | Specifies alternate FTP port for upload. |
--ftp-proxy <string> | Proxy server to use for FTP upload. |
--ftp-proxy-port <integer> | Proxy server port to use for FTP upload. |
--ftp-mode <value> | Mode of FTP file transfer. Valid values are both, active, and passive. |
--ftp-user <string> | FTP user to use for FTP upload. |
--ftp-pass <string> | Specify alternative password for FTP. |
--ftp-ssl-cert <string> | Specifies the SSL certificate to use in FTPS connection. |
--ftp-upload-insecure <boolean> | Whether to attempt a plaintext FTP upload. |
--ftp-upload-pass <string> | FTP user to use for FTP upload password. |
--set-ftp-upload-pass | Specify the FTP upload password interactively. |
When a logfile gather arrives at Dell, it is automatically unpacked by a support process and analyzed using the logviewer tool.
Author: Nick Trimbee
Related Blog Posts
OneFS SupportAssist Architecture and Operation
Fri, 21 Apr 2023 16:41:36 -0000
|Read Time: 0 minutes
The previous article in this series looked at an overview of OneFS SupportAssist. Now, we’ll turn our attention to its core architecture and operation.
Under the hood, SupportAssist relies on the following infrastructure and services:
Service | Name |
ESE | Embedded Service Enabler. |
isi_rice_d | Remote Information Connectivity Engine (RICE). |
isi_crispies_d | Coordinator for RICE Incidental Service Peripherals including ESE Start. |
Gconfig | OneFS centralized configuration infrastructure. |
MCP | Master Control Program – starts, monitors, and restarts OneFS services. |
Tardis | Configuration service and database. |
Transaction journal | Task manager for RICE. |
Of these, ESE, isi_crispies_d, isi_rice_d, and the Transaction Journal are new in OneFS 9.5 and exclusive to SupportAssist. By contrast, Gconfig, MCP, and Tardis are all legacy services that are used by multiple other OneFS components.
The Remote Information Connectivity Engine (RICE) represents the new SupportAssist ecosystem for OneFS to connect to the Dell backend. The high level architecture is as follows:
Dell’s Embedded Service Enabler (ESE) is at the core of the connectivity platform and acts as a unified communications broker between the PowerScale cluster and Dell Support. ESE runs as a OneFS service and, on startup, looks for an on-premises gateway server. If none is found, it connects back to the connectivity pipe (SRS). The collector service then interacts with ESE to send telemetry, obtain upgrade packages, transmit alerts and events, and so on.
Depending on the available resources, ESE provides a base functionality with additional optional capabilities to enhance serviceability. ESE is multithreaded, and each payload type is handled by specific threads. For example, events are handled by event threads, binary and structured payloads are handled by web threads, and so on. Within OneFS, ESE gets installed to /usr/local/ese and runs as ‘ese’ user and group.
The responsibilities of isi_rice_d include listening for network changes, getting eligible nodes elected for communication, monitoring notifications from CRISPIES, and engaging Task Manager when ESE is ready to go.
The Task Manager is a core component of the RICE engine. Its responsibility is to watch the incoming tasks that are placed into the journal and to assign workers to step through the tasks until completion. It controls the resource utilization (Python threads) and distributes tasks that are waiting on a priority basis.
The ‘isi_crispies_d’ service exists to ensure that ESE is only running on the RICE active node, and nowhere else. It acts, in effect, like a specialized MCP just for ESE and RICE-associated services, such as IPA. This entails starting ESE on the RICE active node, re-starting it if it crashes on the RICE active node, and stopping it and restarting it on the appropriate node if the RICE active instance moves to another node. We are using ‘isi_crispies_d’ for this, and not MCP, because MCP does not support a service running on only one node at a time.
The core responsibilities of ‘isi_crispies_d’ include:
- Starting and stopping ESE on the RICE active node
- Monitoring ESE and restarting, if necessary. ‘isi_crispies_d’ restarts ESE on the node if it crashes. It will retry a couple of times and then notify RICE if it’s unable to start ESE.
- Listening for gconfig changes and updating ESE. Stopping ESE if unable to make a change and notifying RICE.
- Monitoring other related services.
The state of ESE, and of other RICE service peripherals, is stored in the OneFS tardis configuration database so that it can be checked by RICE. Similarly, ‘isi_crispies_d’ monitors the OneFS Tardis configuration database to see which node is designated as the RICE ‘active’ node.
The ‘isi_telemetry_d’ daemon is started by MCP and runs when SupportAssist is enabled. It does not have to be running on the same node as the active RICE and ESE instance. Only one instance of ‘isi_telemetry_d’ will be active at any time, and the other nodes will be waiting for the lock.
You can query the current status and setup of SupportAssist on a PowerScale cluster by using the ‘isi supportassist settings view’ CLI command. For example:
# isi supportassist settings view Service enabled: Yes Connection State: enabled OneFS Software ID: ELMISL08224764 Network Pools: subnet0:pool0 Connection mode: direct Gateway host: - Gateway port: - Backup Gateway host: - Backup Gateway port: - Enable Remote Support: Yes Automatic Case Creation: Yes Download enabled: Yes
You can also do this from the WebUI by navigating to Cluster management > General settings > SupportAssist:
You can enable or disable SupportAssist by using the ‘isi services’ CLI command set. For example:
# isi services isi_supportassist disable The service 'isi_supportassist' has been disabled. # isi services isi_supportassist enable The service 'isi_supportassist' has been enabled. # isi services -a | grep supportassist isi_supportassist SupportAssist Monitor Enabled
You can check the core services, as follows:
# ps -auxw | grep -e 'rice' -e 'crispies' | grep -v grep root 8348 9.4 0.0 109844 60984 - Ss 22:14 0:00.06 /usr/libexec/isilon/isi_crispies_d /usr/bin/isi_crispies_d root 8183 8.8 0.0 108060 64396 - Ss 22:14 0:01.58 /usr/libexec/isilon/isi_rice_d /usr/bin/isi_rice_d
Note that when a cluster is provisioned with SupportAssist, ESRS can no longer be used. However, customers that have not previously connected their clusters to Dell Support can still provision ESRS, but will be presented with a message encouraging them to adopt the best practice of using SupportAssist.
Additionally, SupportAssist in OneFS 9.5 does not currently support IPv6 networking, so clusters deployed in IPv6 environments should continue to use ESRS until SupportAssist IPv6 integration is introduced in a future OneFS release.
Author: Nick Trimbee
OneFS Logfile Collection with isi-gather-info
Sun, 18 Dec 2022 19:11:11 -0000
|Read Time: 0 minutes
The previous blog outlining the investigation and troubleshooting of OneFS deadlocks and hang-dumps generated several questions about OneFS logfile gathering. So it seemed like a germane topic to explore in an article.
The OneFS ‘isi_gather_info’ utility has long been a cluster staple for collecting and collating context and configuration that primarily aids support in the identification and resolution of bugs and issues. As such, it is arguably OneFS’ primary support tool and, in terms of actual functionality, it performs the following roles:
- Executes many commands, scripts, and utilities on cluster, and saves their results
- Gathers all these files into a single ‘gzipped’ package.
- Transmits the gather package back to Dell, using several optional transport methods.
By default, a log gather tarfile is written to the /ifs/data/Isilon_Support/pkg/ directory. It can also be uploaded to Dell using the following means:
Transport Mechanism | Description | TCP Port |
ESRS | Uses Dell EMC Secure Remote Support (ESRS) for gather upload. | 443/8443 |
FTP | Use FTP to upload completed gather. | 21 |
HTTP | Use HTTP to upload gather. | 80/443 |
More specifically, the ‘isi_gather_info’ CLI command syntax includes the following options:
Option | Description |
–upload <boolean> | Enable gather upload. |
–esrs <boolean> | Use ESRS for gather upload. |
–gather-mode (incremental | full) | Type of gather: incremental, or full. |
–http-insecure-upload <boolean> | Enable insecure HTTP upload on completed gather. |
–http-upload-host <string> | HTTP Host to use for HTTP upload. |
–http-upload-path <string> | Path on HTTP server to use for HTTP upload. |
–http-upload-proxy <string> | Proxy server to use for HTTP upload. |
–http-upload-proxy-port <integer> | Proxy server port to use for HTTP upload. |
–clear-http-upload-proxy-port | Clear proxy server port to use for HTTP upload. |
–ftp-upload <boolean> | Enable FTP upload on completed gather. |
–ftp-upload-host <string> | FTP host to use for FTP upload. |
–ftp-upload-path <string> | Path on FTP server to use for FTP upload. |
–ftp-upload-proxy <string> | Proxy server to use for FTP upload. |
–ftp-upload-proxy-port <integer> | Proxy server port to use for FTP upload. |
–clear-ftp-upload-proxy-port | Clear proxy server port to use for FTP upload. |
–ftp-upload-user <string> | FTP user to use for FTP upload. |
–ftp-upload-ssl-cert <string> | Specifies the SSL certificate to use in FTPS connection. |
–ftp-upload-insecure <boolean> | Whether to attempt a plain text FTP upload. |
–ftp-upload-pass <string> | FTP user to use for FTP upload password. |
–set-ftp-upload-pass | Specify the FTP upload password interactively. |
When the gather arrives at Dell, it is automatically unpacked by a support process and analyzed using the ‘logviewer’ tool.
Under the hood, there are two principal components responsible for running a gather. These are:
Component | Description |
Overlord | The manager process, triggered by the user, which oversees all the isi_gather_info tasks that are executed on a single node. |
Minion | The worker process, which runs a series of commands (specified by the overlord) on a specific node. |
The ‘isi_gather_info’ utility is primarily written in Python, with its configuration under the purview of MCP, and RPC services provided by the isi_rpc_d daemon.
For example:
# isi_gather_info& # ps -auxw | grep -i gather root 91620 4.4 0.1 125024 79028 1 I+ 16:23 0:02.12 python /usr/bin/isi_gather_info (python3.8) root 91629 3.2 0.0 91020 39728 - S 16:23 0:01.89 isi_rpc_d: isi.gather.minion.minion.GatherManager (isi_rpc_d) root 93231 0.0 0.0 11148 2692 0 D+ 16:23 0:00.01 grep -i gather
The overlord uses isi_rdo (the OneFS remote command execution daemon) to start up the minion processes and informs them of the commands to be executed by an ephemeral XML file, typically stored at /ifs/.ifsvar/run/<uuid>-gather_commands.xml. The minion then spins up an executor and a command for each entry in the XML file.
The parallel process executor (the default one to use) acts as a pool, triggering commands to run in parallel until a specified number are running in parallel. The commands themselves take care of the running and processing of results, checking frequently to ensure that the timeout threshold has not been passed.
The executor also keeps track of which commands are currently running, and how many are complete, and writes them to a file so that the overlord process can display useful information. When this is complete, the executor returns the runtime information to the minion, which records the benchmark file. The executor will also safely shut itself down if the isi_gather_info lock file disappears, such as if the isi_gather_info process is killed.
During a gather, the minion returns nothing to the overlord process, because the output of its work is written to disk.
Architecturally, the ‘gather’ process comprises an eight phase workflow:
The details of each phase are as follows:
Phase | Description |
1. Setup | Reads from the arguments passed in, and from any config files on disk, and sets up the config dictionary, which will be used throughout the rest of the codebase. Most of the code for this step is contained in isilon/lib/python/gather/igi_config/configuration.py. This is also the step where the program is most likely to exit, if some config arguments end up being invalid. |
2. Run local | Executes all the cluster commands, which are run on the same node that is starting the gather. All these commands run in parallel (up to the current parallelism value). This is typically the second longest running phase. |
3. Run nodes | Executes the node commands across all of the cluster’s nodes. This runs on each node, and while these commands run in parallel (up to the current parallelism value), they do not run in parallel with the local step. |
4. Collect | Ensures that all results end up on the overlord node (the node that started gather). If gather is using /ifs, it is very fast, but if it’s not, it needs to SCP all the node results to a single node. |
5. Generate Extra Files | Generates nodes_info and package_info.xml. These two files are present in every single gather, and tell us some important metadata about the cluster. |
6. Packing | Packs (tars and gzips) all the results. This is typically the longest running phase, often by an order of magnitude. |
7. Upload | Transports the tarfile package to its specified destination. Depending on the geographic location, this phase might also be lengthy. |
8. Cleanup | Cleans up any intermediary files that were created on cluster. This phase will run even if gather fails or is interrupted. |
Because the isi_gather_info tool is primarily intended for troubleshooting clusters with issues, it runs as root (or compadmin in compliance mode), because it needs to be able to execute under degraded conditions (that is, without GMP, during upgrade, and under cluster splits, and so on). Given these atypical requirements, isi_gather_info is built as a stand-alone utility, rather than using the platform API for data collection.
The time it takes to complete a gather is typically determined by cluster configuration, rather than size. For example, a gather on a small cluster with a large number of NFS shares will take significantly longer than on large cluster with a similar NFS configuration. Incremental gathers are not recommended, because the base that’s required to check against in the log store may be deleted. By default, gathers only persist for two weeks in the log processor.
On completion of a gather, a tar’d and zipped logset is generated and placed under the cluster’s /ifs/data/IsilonSupport/pkg directory by default. A standard gather tarfile unpacks to the following top-level structure:
# du -sh * 536M IsilonLogs-powerscale-f900-cl1-20220816-172533-3983fba9-3fdc-446c-8d4b-21392d2c425d.tgz 320K benchmark 24K celog_events.xml 24K command_line 128K complete 449M local 24K local.log 24K nodes_info 24K overlord.log 83M powerscale-f900-cl1-1 24K powerscale-f900-cl1-1.log 119M powerscale-f900-cl1-2 24K powerscale-f900-cl1-2.log 134M powerscale-f900-cl1-3 24K powerscale-f900-cl1-3.log
In this case, for a three node F900 cluster, the compressed tarfile is 536 MB in size. The bulk of the data, which is primarily CLI command output, logs, and sysctl output, is contained in the ‘local’ and individual node directories (powerscale-f900-cl1-*). Each node directory contains a tarfile, varlog.tar, containing all the pertinent logfiles for that node.
The root directory of the tarfile file includes the following:
Item | Description |
benchmark | § Runtimes for all commands executed by the gather. |
celog_events.xml |
§ Cluster/Node names § Node Serial numbers § Configuration ID § OneFS version info § Events |
complete | § Lists of complete commands run across the cluster and on individual nodes |
local |
|
nodes_info |
|
overlord.log | § Gather execution and issue log. |
package_info.xml | § Cluster version details, GUID, S/N, and customer info (name, phone, email, and so on). |
command_line |
|
Notable contents of the ‘local’ directory (all the cluster-wide commands that are executed on the node running the gather) include:
Local Contents Item | Description |
isi_alerts_history
|
|
isi_job_list |
|
isi_job_schedule |
|
isi_license |
|
isi_network_interfaces | § State and configuration of all the cluster’s network interfaces. |
isi_nfs_exports | § Configuration detail for all the cluster’s NFS exports. |
isi_services | § Listing of all the OneFS services and whether they are enabled or disabled. More detailed configuration for each service is contained in separate files. For example, for SnapshotIQ:
|
isi_smb | § Detailed configuration info for all the cluster’s NFS exports. |
isi_stat | § Overall status of the cluster, including networks, drives, and so on. |
isi_statistics | § CPU, protocol, and disk IO stats. |
Contents of the directory for the ‘node’ directory include:
Node Contents Item | Description |
df | Output of the df command |
du |
|
isi_alerts | Contains a list of outstanding alerts on the node |
ps and ps_full | Lists of all running process at the time that isi_gather_info was executed. |
As the isi_gather_info command runs, status is provided in the interactive CLI session:
# isi_gather_info Configuring COMPLETE running local commands IN PROGRESS \ Progress of local [######################################################## ] 147/152 files written \ Some active commands are: ifsvar_modules_jobengine_cp, isi_statistics_heat, ifsv ar_modules
When the gather has completed, the location of the tarfile on the cluster itself is reported as follows:
# isi_gather_info Configuring COMPLETE running local commands COMPLETE running node commands COMPLETE collecting files COMPLETE generating package_info.xml COMPLETE tarring gather COMPLETE uploading gather COMPLETE
The path to the tar-ed gather is:
/ifs/data/Isilon_Support/pkg/IsilonLogs-h5001-20220830-122839-23af1154-779c-41e9-b0bd-d10a026c9214.tgz
If the gather upload services are unavailable, errors are displayed on the console, as shown here:
… uploading gather FAILED ESRS failed - ESRS has not been provisioned FTP failed - pycurl error: (28, 'Failed to connect to ftp.isilon.com port 21 after 81630 ms: Operation timed out')
Author: Nick Trimbee