OneFS Logfile Collection with isi-gather-info
Sun, 18 Dec 2022 19:11:11 -0000
|Read Time: 0 minutes
The previous blog outlining the investigation and troubleshooting of OneFS deadlocks and hang-dumps generated several questions about OneFS logfile gathering. So it seemed like a germane topic to explore in an article.
The OneFS ‘isi_gather_info’ utility has long been a cluster staple for collecting and collating context and configuration that primarily aids support in the identification and resolution of bugs and issues. As such, it is arguably OneFS’ primary support tool and, in terms of actual functionality, it performs the following roles:
- Executes many commands, scripts, and utilities on cluster, and saves their results
- Gathers all these files into a single ‘gzipped’ package.
- Transmits the gather package back to Dell, using several optional transport methods.
By default, a log gather tarfile is written to the /ifs/data/Isilon_Support/pkg/ directory. It can also be uploaded to Dell using the following means:
Transport Mechanism | Description | TCP Port |
ESRS | Uses Dell EMC Secure Remote Support (ESRS) for gather upload. | 443/8443 |
FTP | Use FTP to upload completed gather. | 21 |
HTTP | Use HTTP to upload gather. | 80/443 |
More specifically, the ‘isi_gather_info’ CLI command syntax includes the following options:
Option | Description |
–upload <boolean> | Enable gather upload. |
–esrs <boolean> | Use ESRS for gather upload. |
–gather-mode (incremental | full) | Type of gather: incremental, or full. |
–http-insecure-upload <boolean> | Enable insecure HTTP upload on completed gather. |
–http-upload-host <string> | HTTP Host to use for HTTP upload. |
–http-upload-path <string> | Path on HTTP server to use for HTTP upload. |
–http-upload-proxy <string> | Proxy server to use for HTTP upload. |
–http-upload-proxy-port <integer> | Proxy server port to use for HTTP upload. |
–clear-http-upload-proxy-port | Clear proxy server port to use for HTTP upload. |
–ftp-upload <boolean> | Enable FTP upload on completed gather. |
–ftp-upload-host <string> | FTP host to use for FTP upload. |
–ftp-upload-path <string> | Path on FTP server to use for FTP upload. |
–ftp-upload-proxy <string> | Proxy server to use for FTP upload. |
–ftp-upload-proxy-port <integer> | Proxy server port to use for FTP upload. |
–clear-ftp-upload-proxy-port | Clear proxy server port to use for FTP upload. |
–ftp-upload-user <string> | FTP user to use for FTP upload. |
–ftp-upload-ssl-cert <string> | Specifies the SSL certificate to use in FTPS connection. |
–ftp-upload-insecure <boolean> | Whether to attempt a plain text FTP upload. |
–ftp-upload-pass <string> | FTP user to use for FTP upload password. |
–set-ftp-upload-pass | Specify the FTP upload password interactively. |
When the gather arrives at Dell, it is automatically unpacked by a support process and analyzed using the ‘logviewer’ tool.
Under the hood, there are two principal components responsible for running a gather. These are:
Component | Description |
Overlord | The manager process, triggered by the user, which oversees all the isi_gather_info tasks that are executed on a single node. |
Minion | The worker process, which runs a series of commands (specified by the overlord) on a specific node. |
The ‘isi_gather_info’ utility is primarily written in Python, with its configuration under the purview of MCP, and RPC services provided by the isi_rpc_d daemon.
For example:
# isi_gather_info& # ps -auxw | grep -i gather root 91620 4.4 0.1 125024 79028 1 I+ 16:23 0:02.12 python /usr/bin/isi_gather_info (python3.8) root 91629 3.2 0.0 91020 39728 - S 16:23 0:01.89 isi_rpc_d: isi.gather.minion.minion.GatherManager (isi_rpc_d) root 93231 0.0 0.0 11148 2692 0 D+ 16:23 0:00.01 grep -i gather
The overlord uses isi_rdo (the OneFS remote command execution daemon) to start up the minion processes and informs them of the commands to be executed by an ephemeral XML file, typically stored at /ifs/.ifsvar/run/<uuid>-gather_commands.xml. The minion then spins up an executor and a command for each entry in the XML file.
The parallel process executor (the default one to use) acts as a pool, triggering commands to run in parallel until a specified number are running in parallel. The commands themselves take care of the running and processing of results, checking frequently to ensure that the timeout threshold has not been passed.
The executor also keeps track of which commands are currently running, and how many are complete, and writes them to a file so that the overlord process can display useful information. When this is complete, the executor returns the runtime information to the minion, which records the benchmark file. The executor will also safely shut itself down if the isi_gather_info lock file disappears, such as if the isi_gather_info process is killed.
During a gather, the minion returns nothing to the overlord process, because the output of its work is written to disk.
Architecturally, the ‘gather’ process comprises an eight phase workflow:
The details of each phase are as follows:
Phase | Description |
1. Setup | Reads from the arguments passed in, and from any config files on disk, and sets up the config dictionary, which will be used throughout the rest of the codebase. Most of the code for this step is contained in isilon/lib/python/gather/igi_config/configuration.py. This is also the step where the program is most likely to exit, if some config arguments end up being invalid. |
2. Run local | Executes all the cluster commands, which are run on the same node that is starting the gather. All these commands run in parallel (up to the current parallelism value). This is typically the second longest running phase. |
3. Run nodes | Executes the node commands across all of the cluster’s nodes. This runs on each node, and while these commands run in parallel (up to the current parallelism value), they do not run in parallel with the local step. |
4. Collect | Ensures that all results end up on the overlord node (the node that started gather). If gather is using /ifs, it is very fast, but if it’s not, it needs to SCP all the node results to a single node. |
5. Generate Extra Files | Generates nodes_info and package_info.xml. These two files are present in every single gather, and tell us some important metadata about the cluster. |
6. Packing | Packs (tars and gzips) all the results. This is typically the longest running phase, often by an order of magnitude. |
7. Upload | Transports the tarfile package to its specified destination. Depending on the geographic location, this phase might also be lengthy. |
8. Cleanup | Cleans up any intermediary files that were created on cluster. This phase will run even if gather fails or is interrupted. |
Because the isi_gather_info tool is primarily intended for troubleshooting clusters with issues, it runs as root (or compadmin in compliance mode), because it needs to be able to execute under degraded conditions (that is, without GMP, during upgrade, and under cluster splits, and so on). Given these atypical requirements, isi_gather_info is built as a stand-alone utility, rather than using the platform API for data collection.
The time it takes to complete a gather is typically determined by cluster configuration, rather than size. For example, a gather on a small cluster with a large number of NFS shares will take significantly longer than on large cluster with a similar NFS configuration. Incremental gathers are not recommended, because the base that’s required to check against in the log store may be deleted. By default, gathers only persist for two weeks in the log processor.
On completion of a gather, a tar’d and zipped logset is generated and placed under the cluster’s /ifs/data/IsilonSupport/pkg directory by default. A standard gather tarfile unpacks to the following top-level structure:
# du -sh * 536M IsilonLogs-powerscale-f900-cl1-20220816-172533-3983fba9-3fdc-446c-8d4b-21392d2c425d.tgz 320K benchmark 24K celog_events.xml 24K command_line 128K complete 449M local 24K local.log 24K nodes_info 24K overlord.log 83M powerscale-f900-cl1-1 24K powerscale-f900-cl1-1.log 119M powerscale-f900-cl1-2 24K powerscale-f900-cl1-2.log 134M powerscale-f900-cl1-3 24K powerscale-f900-cl1-3.log
In this case, for a three node F900 cluster, the compressed tarfile is 536 MB in size. The bulk of the data, which is primarily CLI command output, logs, and sysctl output, is contained in the ‘local’ and individual node directories (powerscale-f900-cl1-*). Each node directory contains a tarfile, varlog.tar, containing all the pertinent logfiles for that node.
The root directory of the tarfile file includes the following:
Item | Description |
benchmark | § Runtimes for all commands executed by the gather. |
celog_events.xml |
§ Cluster/Node names § Node Serial numbers § Configuration ID § OneFS version info § Events |
complete | § Lists of complete commands run across the cluster and on individual nodes |
local |
|
nodes_info |
|
overlord.log | § Gather execution and issue log. |
package_info.xml | § Cluster version details, GUID, S/N, and customer info (name, phone, email, and so on). |
command_line |
|
Notable contents of the ‘local’ directory (all the cluster-wide commands that are executed on the node running the gather) include:
Local Contents Item | Description |
isi_alerts_history
|
|
isi_job_list |
|
isi_job_schedule |
|
isi_license |
|
isi_network_interfaces | § State and configuration of all the cluster’s network interfaces. |
isi_nfs_exports | § Configuration detail for all the cluster’s NFS exports. |
isi_services | § Listing of all the OneFS services and whether they are enabled or disabled. More detailed configuration for each service is contained in separate files. For example, for SnapshotIQ:
|
isi_smb | § Detailed configuration info for all the cluster’s NFS exports. |
isi_stat | § Overall status of the cluster, including networks, drives, and so on. |
isi_statistics | § CPU, protocol, and disk IO stats. |
Contents of the directory for the ‘node’ directory include:
Node Contents Item | Description |
df | Output of the df command |
du |
|
isi_alerts | Contains a list of outstanding alerts on the node |
ps and ps_full | Lists of all running process at the time that isi_gather_info was executed. |
As the isi_gather_info command runs, status is provided in the interactive CLI session:
# isi_gather_info Configuring COMPLETE running local commands IN PROGRESS \ Progress of local [######################################################## ] 147/152 files written \ Some active commands are: ifsvar_modules_jobengine_cp, isi_statistics_heat, ifsv ar_modules
When the gather has completed, the location of the tarfile on the cluster itself is reported as follows:
# isi_gather_info Configuring COMPLETE running local commands COMPLETE running node commands COMPLETE collecting files COMPLETE generating package_info.xml COMPLETE tarring gather COMPLETE uploading gather COMPLETE
The path to the tar-ed gather is:
/ifs/data/Isilon_Support/pkg/IsilonLogs-h5001-20220830-122839-23af1154-779c-41e9-b0bd-d10a026c9214.tgz
If the gather upload services are unavailable, errors are displayed on the console, as shown here:
… uploading gather FAILED ESRS failed - ESRS has not been provisioned FTP failed - pycurl error: (28, 'Failed to connect to ftp.isilon.com port 21 after 81630 ms: Operation timed out')
Author: Nick Trimbee
Related Blog Posts
PowerScale OneFS 9.7
Wed, 13 Dec 2023 13:55:00 -0000
|Read Time: 0 minutes
Dell PowerScale is already powering up the holiday season with the launch of the innovative OneFS 9.7 release, which shipped today (13th December 2023). This new 9.7 release is an all-rounder, introducing PowerScale innovations in Cloud, Performance, Security, and ease of use.
After the debut of APEX File Storage for AWS earlier this year, OneFS 9.7 extends and simplifies the PowerScale in the public cloud offering, delivering more features on more instance types across more regions.
In addition to providing the same OneFS software platform on-prem and in the cloud, and customer-managed for full control, APEX File Storage for AWS in OneFS 9.7 sees a 60% capacity increase, providing linear capacity and performance scaling up to six SSD nodes and 1.6 PiB per namespace/cluster, and up to 10GB/s reads and 4GB/s writes per cluster. This can make it a solid fit for traditional file shares and home directories, vertical workloads like M&E, healthcare, life sciences, finserv, and next-gen AI, ML and analytics applications.
Enhancements to APEX File Storage for AWS
PowerScale’s scale-out architecture can be deployed on customer managed AWS EBS and ECS infrastructure, providing the scale and performance needed to run a variety of unstructured workflows in the public cloud. Plus, OneFS 9.7 provides an ‘easy button’ for streamlined AWS infrastructure provisioning and deployment.
Once in the cloud, you can further leverage existing PowerScale investments by accessing and orchestrating your data through the platform's multi-protocol access and APIs.
This includes the common OneFS control plane (CLI, WebUI, and platform API), and the same enterprise features: Multi-protocol, SnapshotIQ, SmartQuotas, Identity management, and so on.
With OneFS 9.7, APEX File Storage for AWS also sees the addition of support for HDFS and FTP protocols, in addition to NFS, SMB, and S3. Granular performance prioritization and throttling is also enabled with SmartQoS, allowing admins to configure limits on the maximum number of protocol operations that NFS, S3, SMB, or mixed protocol workloads can consume on an APEX File Storage for AWS cluster.
Security
With data integrity and protection being top of mind in this era of unprecedented cyber threats, OneFS 9.7 brings a bevy of new features and functionality to keep your unstructured data and workloads more secure than ever. These new OneFS 9.7 security enhancements help address US Federal and DoD mandates, such as FIPS 140-2 and DISA STIGs – in addition to general enterprise data security requirements. Included in the new OneFS 9.7 release is a simple cluster configuration backup and restore utility, address space layout randomization, and single sign-on (SSO) lookup enhancements.
Data mobility
On the data replication front, SmartSync sees the introduction of GCP as an object storage target in OneFS 9.7, in addition to ECS, AWS and Azure. The SmartSync data mover allows flexible data movement and copying, incremental resyncs, push and pull data transfer, and one-time file to object copy.
Performance improvements
Building on the streaming read performance delivered in a prior release, OneFS 9.7 also unlocks dramatic write performance enhancements, particularly for the all-flash NVMe platforms - plus infrastructure support for future node hardware platform generations. A sizable boost in throughput to a single client helps deliver performance for the most demanding GenAI workloads, particularly for the model training and inferencing phases. Additionally, the scale-out cluster architecture enables performance to scale linearly as GPUs are increased, allowing PowerScale to easily support AI workflows from small to large.
Cluster support for InsightIQ 5.0
The new InsightIQ 5.0 software expands PowerScale monitoring capabilities, including a new user interface, automated email alerts, and added security. InsightIQ 5.0 is available today for all existing and new PowerScale customers at no additional charge. These innovations are designed to simplify management, expand scale and security, and automate operations for PowerScale performance monitoring for AI, GenAI, and all other workloads.
In summary, OneFS 9.7 brings the following new features and functionality to the Dell PowerScale ecosystem:
We’ll be taking a deeper look at these new features and functionality in blog articles over the course of the next few weeks.
Meanwhile, the new OneFS 9.7 code is available on the Dell Support site, as both an upgrade and reimage file, allowing both installation and upgrade of this new release.
OneFS SSL Certificate Renewal – Part 1
Thu, 16 Nov 2023 04:57:00 -0000
|Read Time: 0 minutes
When using either the OneFS WebUI or platform API (pAPI), all communication sessions are encrypted using SSL (Secure Sockets Layer), also known as Transport Layer Security (TLS). In this series, we will look at how to replace or renew the SSL certificate for the OneFS WebUI.
SSL requires a certificate that serves two principal functions: It grants permission to use encrypted communication using Public Key Infrastructure and authenticates the identity of the certificate’s holder.
Architecturally, SSL consists of four fundamental components:
SSL Component | Description |
Alert | Reports issues. |
Change cipher spec | Implements negotiated crypto parameters. |
Handshake | Negotiates crypto parameters for SSL session. Can be used for many SSL/TCP connections. |
Record | Provides encryption and MAC. |
These sit in the stack as follows:
The basic handshake process begins with a client requesting an HTTPS WebUI session to the cluster. OneFS then returns the SSL certificate and public key. The client creates a session key, encrypted with the public key it is received from OneFS. At this point, the client only knows the session key. The client now sends its encrypted session key to the cluster, which decrypts it with the private key. Now, both the client and OneFS know the session key. So, finally, the session, encrypted using a symmetric session key, can be established. OneFS automatically defaults to the best supported version of SSL, based on the client request.
A PowerScale cluster initially contains a self-signed certificate, which can be used as-is or replaced with a third-party certificate authority (CA)-issued certificate. If the self-signed certificate is used upon expiry, it must be replaced with either a third-party (public or private) CA-issued certificate or another self-signed certificate that is generated on the cluster. The following are the default locations for the server.crt and server.key files.
File | Location |
SSL certificate | /usr/local/apache2/conf/ssl.crt/server.crt |
SSL certificate key | /usr/local/apache2/conf/ssl.key/server.key |
The ‘isi certificate settings view’ CLI command displays all of the certificate-related configuration options. For example:
# isi certificate settings view Certificate Monitor Enabled: Yes Certificate Pre Expiration Threshold: 4W2D Default HTTPS Certificate ID: default Subject: C=US, ST=Washington, L=Seattle, O="Isilon", OU=Isilon, CN=Dell, emailAddress=tme@isilon.com Status: valid |
The above ‘certificate monitor enabled’ and ‘certificate pre expiration threshold’ configuration options govern a nightly cron job, which monitors the expiration of each managed certificate and fires a CELOG alert if a certificate is set to expire within the configured threshold. Note that the default expiration is 30 days (4W2D, which represents 4 weeks plus 2 days). The ‘ID: default’ configuration option indicates that this certificate is the default TLS certificate.
The basic certificate renewal or creation flow is as follows:
The steps below include options to complete a self-signed certificate replacement or renewal, or to request an SSL replacement or renewal from a Certificate Authority (CA).
Backing up the existing SSL certificate
The first task is to obtain the list of certificates by running the following CLI command, and identify the appropriate one to renew:
# isi certificate server list ID Name Status Expires ------------------------------------------- eb0703b default valid 2025-10-11T10:45:52 ------------------------------------------- |
It’s always a prudent practice to save a backup of the original certificate and key. This can be easily accomplished using the following CLI commands, which, in this case, create the directory ‘/ifs/data/ssl_bkup’ directory, set the perms to root-only access, and copy the original key and certificate to it:
# mkdir -p /ifs/data/ssl_bkup # chmod 700 /ifs/data/ssl_bkup # cp /usr/local/apache24/conf/ssl.crt/server.crt /ifs/data/ssl_bkup # cp /usr/local/apache24/conf/ssl.key/server.key /ifs/data/ssl_bkup # cd !$ cd /ifs/data/ssl_bkup # ls server.crt server.key |
Renewing or creating a certificate
The next step in the process involves either the renewal of an existing certificate or creation of a certificate from scratch. In either case, first, create a temporary directory, for example /ifs/tmp:
# mkdir /ifs/tmp; cd /ifs/tmp |
a) Renew an existing self-signed Certificate.
The following syntax creates a renewal certificate based on the existing ssl.key. The value of the ‘-days’ parameter can be adjusted to generate a certificate with the wanted expiration date. For example, the following command will create a one-year certificate.
# cp /usr/local/apache2/conf/ssl.key/server.key ./ ; openssl req -new -days 365 -nodes -x509 -key server.key -out server.crt |
Answer the system prompts to complete the self-signed SSL certificate generation process, entering the pertinent information location and contact information. For example:
Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:Washington
Locality Name (eg, city) []:Seattle
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Isilon
Organizational Unit Name (eg, section) []:TME
Common Name (e.g. server FQDN or YOUR name) []:isilon.com
Email Address []:tme@isilon.com
When all the information has been successfully entered, the server.csr and server.key files will be generated under the /ifs/tmp directory.
Optionally, the attributes and integrity of the certificate can be verified with the following syntax:
# openssl x509 -text -noout -in server.crt |
Next, proceed directly to the ‘Add the certificate to the cluster’ steps in section 4 of this article.
b) Alternatively, a certificate and key can be generated from scratch, if preferred.
The following CLI command can be used to create an 2048-bit RSA private key:
# openssl genrsa -out server.key 2048 Generating RSA private key, 2048 bit long modulus ............+++++
...........................................................+++++
e is 65537 (0x10001) |
Next, create a certificate signing request:
# openssl req -new -nodes -key server.key -out server.csr |
For example:
# openssl req -new -nodes -key server.key -out server.csr -reqexts SAN -config <(cat /etc/ssl/openssl.cnf <(printf "[SAN]\nsubjectAltName=DNS:isilon.com")) You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:US State or Province Name (full name) [Some-State]:WA Locality Name (eg, city) []:Seattle Organization Name (eg, company) [Internet Widgits Pty Ltd]:Isilon Organizational Unit Name (eg, section) []:TME Common Name (e.g. server FQDN or YOUR name) []:h7001 Email Address []:tme@isilon.com Please enter the following 'extra' attributes to be sent with your certificate request A challenge password []:1234 An optional company name []: # |
Answer the system prompts to complete the self-signed SSL certificate generation process, entering the pertinent information location and contact information. Additionally, a ‘challenge password’ with a minimum of 4-bytes in length will need to be selected and entered.
As prompted, enter the information to be incorporated into the certificate request. When completed, the server.csr and server.key files will appear in the /ifs/tmp directory.
If wanted, a CSR file for a Certificate Authority, which includes Subject-Alternative-Names (SAN) can be generated. For example, additional host name entries can be added using a comma (IE. DNS:isilon.com,DNS:www.isilon.com).
In the next article, we will look at the certificate singing, addition, and verification steps of the process.