Accelerating and Optimizing AI Operations with Infrastructure as Code
Fri, 03 May 2024 12:00:00 -0000
|Read Time: 0 minutes
Accelerating and Optimizing AI Operations with Infrastructure as Code
Achieving maturity in a DevOps organization requires overcoming various barriers and following specific steps. The level of maturity attained depends on the short-term and long-term goals set for the infrastructure. In the short term, IT teams must focus on upskilling their resources and integrating tools for containerization and automation throughout the operating lifecycles, from Day 0 to Day 2. Any progress made in scaling up containerized environments and automating processes significantly enhances the long-term economic viability and sustainability of the company. Furthermore, in the long term, it involves deploying these solutions across multicloud, multisite landscapes and effectively balancing workloads.
The optimization of your AI applications, and by extension, other high-value workloads, hinges upon the velocity, scalability, and efficacy of your infrastructure, as well as the maturity of your DevOps processes. Prior to the explosion that is AI, recent survey results indicated the state of automation for infrastructure operations’ workflows was overall less than 50%; partner that with twofold the increase of application counts and organizations may struggle against the waves of change[1].
From compute capabilities to storage density and speed, spanning across unstructured, block, and file formats, there exists fundamental elements of automation ripe for swift integration to establish a robust foundation. By seamlessly layering pre-built integration tools and a complementary portfolio of products at each stage, the journey towards ramping up AI can be alleviated.
There are important considerations regarding the various hardware infrastructure components for a generative AI system, including high performance computing, highspeed networking, and scalable, high-capacity, and low-latency storage to name a few. The infrastructure requirements for AI/ML workloads are dynamic and dependent on several factors, including the nature of the task, the size of the dataset, the complexity of the model, and the desired performance levels. There is no one-size-fits-all solution when it comes to Gen AI infrastructure, as different tasks and projects may demand unique configurations. Central to the success of generative AI initiatives is the adoption of Infrastructure-as-Code (IaC) principles which facilitate the automation and orchestration of underlying infrastructure components. By leveraging IaC tools like RedHat Ansible and HashiCorp Terraform, organizations can streamline the deployment and management of hardware resources, ensuring seamless integration with Gen AI workloads.
At the base of this foundation is Red Hat Ansible modules for Dell, and they speed up the provisioning of servers and storage for quick AI application workload mobility.
Creating playbooks with Ansible to automate server configurations, provisioning, deployments, and updates are seamless while data is being collected. Due to the declarative and mutable nature of Ansible, the playbooks can be changed in real-time without interruption to processes or end users.
Compute
On the compute front, a lot goes into configuring servers for the different AI and ML operations:
GPU Drivers and CUDA toolkit Installation: Install appropriate GPU drivers for the server's GPU hardware. For example, installing CUDA Toolkit and drivers to enable GPU acceleration for deep learning frameworks such as TensorFlow and PyTorch.
Deep Learning Framework Installation: Install popular deep learning frameworks such as TensorFlow or PyTorch, along with their associated dependencies.
Containerization: Consider using containerization technologies such as Docker or Kubernetes to encapsulate AI workloads and their dependencies into portable and isolated containers. Containerization facilitates reproducibility, scalability, and resource isolation, making it easier to deploy and manage GenAI workloads across different environments.
Performance Optimization: Optimize server configurations, kernel parameters, and system settings to maximize performance and resource utilization for GenAI workloads. Tune CPU and GPU settings, memory allocation, disk I/O, and network configurations based on workload characteristics and hardware capabilities.
Monitoring and Management: Implement monitoring and management tools to track server performance metrics, resource utilization, and workload behavior in real-time.
Security Hardening: Ensure server security by applying security best practices, installing security patches and updates, configuring firewalls, and implementing access controls. Protect sensitive data and AI models from unauthorized access, tampering, or exploitation by following security guidelines and compliance standards.
Dell Openmanage Ansible collection offers modules and roles both at the iDRAC/Redfish interface level and at the OpenManage Enterprise level for server configurations such as PowerEdge XE 9860 designed to collect, develop, train, and deploy large machine learning models (LLMs).
The following is a summary of the OME and iDRAC modules and roles as part of the openmanage collection:
Storage
When it comes to AI and storage, during the data processing and training aspects, customers rely on scalable and simple access to file systems which increased data is trained on. With AI unstructured data storage is necessary for the bounty of rich context and nuance that will be accessed during the building phase. It also highly depends on user access to be variable, and Ansible automation playbooks can help change and adapt quickly.
Dell PowerScale is the world’s leading scale-out NAS platform, and it recently became the first ethernet storage certified on NVIDIA SuperPod. When it comes to Ansible automation, PowerScale comes with an extensive set of modules that covers a wide range of platform operations:
Software defined storage
Hyper converged platforms like PowerFlex offer highly scalable and configurable compute and storage clusters. In addition to the common day-2 tasks like storage provisioning, data protection and user management, the Ansible collection for PowerFlex can be used for cluster deployment and expansion. Here is a summary of what Ansible collections for PowerFlex offers:
Conclusion
The one thing agreed upon is that Generative AI tools need the scale, repeatability, and reliability beyond anything created from the software and data center combined. This is precisely what building infrastructure-as-code practices into a multisite operation are designated to do. From PowerEdge to PowerScale, the level of capacity and performance is unmatched. This allows AI operations and Generative AI to absorb, grow and provide the intelligence that organizations need to be competitive and innovative.
[1] Infrastructure-as-code and DevOps Automation: The Keys to Unlocking Innovation and Resilience, September 2023
Other resources:
GenAI Acceleration Depends on Infrastructure as Code
Authors: Jennifer Aspesi, Parasar Kodati
Related Blog Posts
For the Year 2022: Ansible Integration Enhancements for the Dell Infrastructure Solutions Portfolio
Mon, 29 Apr 2024 19:20:40 -0000
|Read Time: 0 minutes
The Dell infrastructure portfolio spans the entire hybrid cloud, from storage to compute to networking, and all the software functionality to deploy, manage, and monitor different application stacks from traditional databases to containerized applications deployed on Kubernetes. When it comes to integrating the infrastructure portfolio with 3rd party IT Operations platforms, Ansible is at the top of the list in terms of expanding the scope and depth of integration.
Here is a summary of the enhancements we made to the various Ansible modules across the Dell portfolio in 2022:
- Ansible plugin for PowerStore had four different releases (1.5,1.6,1.7, and 1.8) with the following capabilities:
- New modules:
- dellemc.powerstore.ldap_account – To manage LDAP account on Dell PowerStore
- dellemc.powerstore.ldap_domain - To manage LDAP domain on Dell PowerStore
- dellemc.powerstore.dns - To manage DNS on Dell PowerStore
- dellemc.powerstore.email - To manage email on Dell PowerStore
- dellemc.powerstore.ntp - To manage NTP on Dell PowerStore
- dellemc.powerstore.remote_support – To manage remote support to get the details, modify the attributes, verify the connection. and send a test alert
- dellemc.powerstore.remote_support_contact – To manage remote support contact on Dell PowerStore
- dellemc.powerstore.smtp_config – To manage SMTP config
- Added support for the host connectivity option to host and host group
- Added support for cluster creation and validating cluster creation attributes
- Data operations:
- Added support to clone, refresh, and restore a volume
- Added support to configure/remove the metro relationship for a volume
- Added support to modify the role of replication sessions
- Added support to clone, refresh, and restore a volume group
- File system:
- Added support to associate/disassociate a protection policy to/from a NAS server
- Added support to handle filesystem and NAS server replication sessions
- Ansible execution:
- Added an execution environment manifest file to support building an execution environment with Ansible Builder
- Enabled check_mode support for Info modules
- Updated modules to adhere to Ansible community guidelines
- The Info module is enhanced to list DNS servers, email notification destinations, NTP servers, remote support configuration, remote support contacts and SMTP configuration, LDAP domain, and LDAP accounts.
- Visit this GitHub page to go through release history: https://github.com/dell/ansible-powerstore/blob/main/CHANGELOG.rst
- New modules:
- Ansible plugin for PowerFlex had four different releases (1.2, 1.3, 1.4, and 1.5) with the following capabilities:
- New modules:
- dellemc.powerflex.replication_consistency_group – To manage replication consistency groups on Dell PowerFlex
- dellemc.powerflex.mdm_cluster – To manage a MDM cluster on Dell PowerFlex
- dellemc.powerflex.protection_domain – To manage a Protection Domain on Dell PowerFlex
- The info module is enhanced to support listing the replication consistency groups, volumes, and storage pools with the statistics data.
- Storage management:
- The storage pool module is enhanced to get the details with the statistics data.
- The volume module is enhanced to get the details with the statistics data.
- Ansible execution:
- Added an execution environment manifest file to support building an execution environment with Ansible Builder
- Enabled check_mode support for the Info module
- Updated modules to adhere to Ansible community guidelines
- Visit this GitHub page to go through release history: https://github.com/dell/ansible-powerflex/blob/main/CHANGELOG.rst
- New modules:
- The Ansible plugin for PowerMax had four different releases (1.7, 1.8, 2.0 and 2.1) with the following capabilities:
- New module: dellemc.powermax.initiator – To manage initiators
- Host operations:
- Added support of case insensitivity of the host WWN to the host, and to the masking view module.
- Enhanced the host module to add or remove initiators to or from the host using an alias.
- Data operations:
- Enhanced storage group module to support
- Moving volumes to destination storage groups.
- Making volume name an optional parameter while adding a new volume to a storage group.
- Setting host I/O limits for existing storage groups and added the ability to force move devices between storage groups with SRDF protection.
- Enhanced volume module to support
- A cylinders option to specify size while creating a LUN, and added the ability to create volumes with identifier_name and volume_id.
- Renaming volumes that were created without a name.
- Enhanced the RDF group module to get volume pair information for an SRDF group.
- Enhanced storage group module to support
- Added an execution environment manifest file to support building an execution environment with Ansible Builder
- Added rotating file handler for log files
- Enhanced the info module to list the initiators, get volume details and masking view connection information
- Enhanced the verifycert parameter in all modules to support file paths for custom certificate location.
- Some things renamed:
- Names of previously released modules have been changed from dellemc_powermax_<module name> to <module name>
- The Gatherfacts module is renamed to Info
- Renamed metro DR module input parameters
- Visit this GitHub page to go through release history: https://github.com/dell/ansible-powermax/blob/main/CHANGELOG.rst
- Ansible plugin for PowerScale had four different releases (1.5,1.6,1.7 and 1.8) with the following capabilities:
- New modules:
- dellemc.powerscale.nfs_alias – To manage NFS aliases on a PowerScale
- dellemc.powerscale.filepoolpolicy – To manage the file pool policy on PowerScale
- dellemc.powerscale.storagepooltier – To manage storage pool tiers on PowerScale
- dellemc.powerscale.networksettings – To manage Network settings on PowerScale
- dellemc.powerscale.smartpoolsettings – To manage Smartpool settings on PowerScale
- Security support:
- Support for security flavors while creating and modifying NFS export.
- Access Zone, SMB, SmartQuota, User and Group modules are enhanced to support the NIS authentication provider.
- The Filesystem module is enhanced to support ACL and container parameters.
- The ADS module is enhanced to support the machine_account and organizational_unit parameters while creating an ADS provider.
- File management:
- The Info module is enhanced to support the listing of NFS aliases.
- Support to create and modify additional parameters of an SMB share in the SMB module.
- Support for recursive force deletion of filesystem directories.
- Ansible execution
- Added an execution environment manifest file to support building an execution environment with Ansible Builder.
- Check mode is supported for the Info, Filepool Policy, and Storagepool Tier modules.
- Added rotating file handlers for log files.
- Removal of the dellemc_powerscale prefix from all module names.
- Other module enhancements:
- The SyncIQ Policy module is enhanced to support accelerated_failback and restrict_target_network of a policy.
- The Info module is enhanced to support NodePools and Storagepool Tiers Subsets.
- The SmartQuota module is enhanced to support container parameter and float values for Quota Parameters.
- Visit this GitHub page to go through release history: https://github.com/dell/ansible- powerscale/blob/main/CHANGELOG.rst
- New modules:
- The Ansible plugin for Dell OpenManage had 13 releases this year, some of which were major releases. Here is a brief summary:
- v7.1: Support for retrieving smart fabric and smart fabric uplink information, support for IPv6 addresses for OMSDK dependent iDRAC modules, and OpenManage Enterprise inventory plugin.
- v7.0: Rebranded from Dell EMC to Dell, enhanced the idrac_firmware module to support proxy, and added support to retrieve iDRAC local user details.
- v6.3: Support for the LockVirtualDisk operation and to configure Remote File Share settings using the idrac_virtual_media module.
- v6.2: Added clear pending BIOS attributes, reset BIOS to default settings, and configured BIOS attributes using Redfish enhancements for idrac_bios.
- v6.1: Support for device-specific operations on OpenManage Enterprise and configuring boot settings on iDRAC.
- v6.0: Added collection metadata for creating execution environments, deprecation of share parameters, and support for configuring iDRAC attributes using the idrac_attributes module.
- v5.5: Support to generate certificate signing request, import, and export certificates on iDRAC.
- v5.4: Enhanced the idrac_server_config_profile module to support export, import, and preview of the SCP configuration using Redfish and added support for check mode.
- v5.3: Added check mode support for redfish_storage_volume, idempotency for the ome_smart_fabric_uplink module, and support for debug logs added to ome_diagnostics
- V5.2: Support to configure console preferences on OpenManage Enterprise.
- V5.1: Support for OpenManage Enterprise Modular server interface management.
- V5.0.1: Support to provide custom or organizational CA signed certificates for SSL validation from the environment variable.
- 5.0: HTTPS SSL support for all modules and quick deploy settings.
- Visit this GitHub page to go through release history: https://github.com/dell/dellemc-openmanage-ansible-modules/releases.
For all Ansible projects you can track the progress, contribute, or report issues on individual repositories.
You can also join our DevOps and Automation community at: https://www.dell.com/community/Automation/bd-p/Automation.
Happy New Year and happy upgrades!
Authors: Parasar Kodati and Florian Coulombel
Introduction to Ansible Network Collection for Dell SmartFabric Services
Wed, 19 Oct 2022 18:54:48 -0000
|Read Time: 0 minutes
SFS Ansible collection
With Dell OS10.5.4.0, the OS10 Ansible automation journey continues. Ansible helps DevOps and NetOps reduce time and effort when designing, managing, and monitoring OS10 networks in enterprise IT environments. Updates to Ansible collections provide new automation benefits for increased operational efficiencies. This blog provides a quick introduction to the Ansible network collection for Dell SmartFabric Services (SFS).
The Ansible network collection for SFS allows you to provision and manage OS10 network switches in SmartFabric Services mode. The collection includes core modules and plugins and supports network_cli and httpapi connections. Supported versions include Ansible 2.10 or later. Sample playbooks and documentation are also included to show how you can use the collection.
With this introduction, there are now additional SFS automation choices. The following table lists some examples of the included modules:
Module Name | Module Description |
---|---|
sfs_setup | Manage configuration of L3 Fabric setup |
sfs_backup_restore | Manage backup restore configuration |
sfs_virtual_network | Manage virtual network configuration |
sfs_preferred_master | Manage preferred master configuration |
sfs_nodes | Manage nodes configuration |
More SFS automation choices
The SmartFabric OS10 SFS feature provides network fabric automation. The SFS leaf and spine personality is integrated with systems including VxRail and PowerStore. SFS delivers autonomous fabric deployment, expansion, and life cycle management. SFS automatically configures leaf and spine fabrics. SFS for leaf and spine is supported on S-series and Z-series Dell PowerSwitches. See the Support matrix for a complete list of supported platforms. For more details see Dell SmartFabric Services User Guide, Release 10.5.4.
You can access the collection by searching for dellemc.sfs on the Ansible Galaxy website. For more information, see the Dell OS10 SmartFabric Services Ansible collection.