Preparation

Thank you for your feedback!

To prepare for installing Slurm, do the following.
1. Create a dedicated non-privileged user account that is consistent for all nodes.
1. Set up password-less SSH from the head node to all other nodes.
2. To maintain consistent UID/GID mappings for users across nodes, choose a unique UID/GID across the cluster.
#export MUNGEUSER=3456
#groupadd -g $MUNGEUSER munge
#useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
Choose a unique UID/GID across the cluster.
#export SLURMUSER=3457
#groupadd -g $SLURMUSER slurm
#useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
Perform this procedure on all cluster nodes.
1. Install MUNGE for authentication.
Note: Check the installation process for different Linux distributions: MUNGE by dun
1. Make sure that all nodes have the same munge.key. Verify that the MUNGE daemon (munged) is started before starting the Slurm daemons.
2. Install MUNGE using the ubuntu repository.
#apt install munge libmunge2 libmunge-dev
#sudo systemctl enable munge
#sudo systemctl status munge
1. If the MUNGE daemon does not automatically start, start the MUNGE daemon manually.
#sudo systemctl start munge
1. Create the munge.key file.
#dd if=/dev/urandom of=/etc/munge/munge.key bs=1c count=4M
#ls -l /etc/munge/munge.key
#chmod a-r /etc/munge/munge.key
#chmod u-w /etc/munge/munge.key
#chmod u+r /etc/munge/munge.key
#chown munge:munge /etc/munge/munge.key
1. Test the MUNGE installation.
#munge -n | unmunge | grep STATUS
STATUS: Success (0)
1. Copy the munge.key file to all nodes in your cluster.
#scp -rp /etc/munge/munge.key <hostname>:/etc/munge/
#sudo systemctl enable munge
#sudo systemctl restart munge
1. Test MUNGE from the head node to all nodes in your cluster.
#munge -n | ssh <hostname> unmunge
STATUS: Success (0)
1. Install MariaDB on the head node. MariaDB will keep all Slurm job information in the database.
#sudo apt install mariadb-server
1. Create the slurm accounting/job completion database.
#sudo mysql -u root
create database slurm_acct_db;
create database slurm_jobcomp_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
grant all privileges on slurm_jobcomp_db.* to 'slurm'@'localhost';
flush privileges;
exit
Installing Slurm
To install Slurm, do the following.
1. Install Slurm packages from the repository if you are using Ubuntu.
The latest version can be downloaded from the Slurm site. One of the compute servers will act as the head node. The head node has a centralized manager (slurmctld) to monitor resources and progress. Each compute server (node) will have a slurm daemon (slurmd).
The slurm-wlm package will install the slurmctld and slurmd daemons.
1. Install the Slurm package on all nodes in the cluster.
#sudo apt update
#sudo apt install slurm-wlm slurm-wlm-doc
1. On the head node, specify:
#cd /usr/share/doc/slurmctld/
#python3 -m http.server
1. Open the local web server’s address in your browser.
http://<ip address>:8000
1. Open the Slurm configurator (slurm-wlm-configurator.html). For details, see the Appendix for a sample Sample slurm.conf file.
2. Copy the configuration to /etc/slurm-llnl/slurm.conf.
3. Copy the configuration to all nodes in your cluster.
#scp -rp /etc/slurm-llnl/slurm.conf <hostname>:/etc/slurm-llnl/
1. On the head node, create and configure the /etc/slurm-llnl/slurmdbd.conf file. See the Appendix for a sample Sample slurmdbd.conf file.
2. Enable Slurm daemon start.
#sudo systemctl enable slurmctld
#sudo systemctl enable slurmdbd
#sudo systemctl enable slurmd
1. Restart the Slurm daemons.
#sudo systemctl restart slurmctld
#sudo systemctl restart slurmdbd
#sudo systemctl restart slurmd
1. On each compute node, enter:
#sudo systemctl enable slurmd
#sudo systemctl restart slurmd
1. Use the Slurm Configuration Tool to generate the slurm.conf file.
  1. Use a web browser to open the slurm configuration tool. Fill out all information and submit. Copy and paste the configuration output to the /etc/slurm-llnl/slurm.conf file. For a basic Linux system, you can use the following parameters:
Cluster name   : android
Control Machine : <hostname of head node>
Compute Machine   : <slurmd -C will show detailed infomation>
Slurm user   : slurm
Scheduling   : backfill
Interconnect  : none
Default MPI Type : none
Process Tracking   : LinuxProc
Resource Selection : Cons_res
(individual sockets, cores and threads may be allocated)
SelectTypeParameter : CR_core
Task Launch   : Affinity
Job Completion Logging : MySQL
Job Account Gather : Linux
Job Accountung Storage : slurmdbd
1. Check for existing slurm.conf and slurmdbd.conf files in /etc/slurm-llnl/. If these files do not exist, see the Appendix for these example files: slurm.conf and slurmdbd.conf.
1. Restart slurmctld and slurmdbd on the head node only and restart slurmd daemon on all nodes in the cluster.
On the head node:
#sudo systemctl restart slurmctld
#sudo systemctl restart slurmdbd
#sudo systemctl restart slurmd
On each compute node:
#sudo systemctl enable slurmd
#sudo systemctl restart slurmd
1. Add cluster ‘android’ to the job completion database using the sacctmgr command. The sacct command will show job completion information.
On the head node:
root@hop-r940-01:~# sacctmgr add cluster android
Adding Cluster(s)
Name = android
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
root@hop-r940-01:~#
1. Run Slurm cluster checks.
  1. sinfo reports the state of partitions and nodes managed by Slurm.
  2. scontrol is an administrative tool to view and/or modify Slurm state.
Note: scontrol can only be executed as root.
root@hop-r940-01:~# sinfo -lN
NODELIST     NODES      PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
hop-r940-01      1 android_build*        idle   96   4:24:1 309506        0      1   (null) none
hop-r940-03      1 android_build*        idle   96   4:24:1 309506        0      1   (null) none
hop-r940-04      1 android_build*        idle   96   4:24:1 309506        0      1   (null) none
root@hop-r940-01:~#
root@hop-r940-01:~# scontrol show partition
PartitionName=android_build
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=hop-r940-0[1,3,4]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=288 TotalNodes=3 SelectTypeParameters=NONE
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
root@hop-r940-01:~#
1. Submit the Slurm job
  1. srun is used to submit a job for execution or initiate job steps in real time.
In this example, we execute /bin/hostname on three nodes (-N3) and include task numbers (-l).
One task per node will be used by default.
root@hop-r940-01:~# srun -N3 -l /bin/hostname
2: hop-r940-04
0: hop-r940-01
1: hop-r940-03
root@hop-r940-01:~#
In this example we execute /bin/hostname on two nodes (-N2)
root@hop-r940-01:~# srun -N2 -l /bin/hostname
0: hop-r940-01
1: hop-r940-03
root@hop-r940-01:~#
1. squeue reports the state of jobs or job steps.
By default, it reports the running jobs in priority order, followed by the pending jobs in priority order.
1. scancel is used to cancel a pending or running job or job step.
In this example, we execute the ‘sleep 30’ command on three nodes, check the queue, then cancel tasks from all three nodes.
srun command from node 3
root@hop-r940-03:~# srun -N3 -l sleep 30
squeue/scancel from node 1
root@hop-r940-01:~# squeue
             JOBID PARTITION     NAME     USER ST       TIME NODES NODELIST(REASON)
               498 android_b    sleep     root R       0:02      3 hop-r940-[01,03-04]
root@hop-r940-01:~# scancel 498
root@hop-r940-01:~# squeue
             JOBID PARTITION     NAME     USER ST       TIME NODES NODELIST(REASON)
root@hop-r940-01:~#

Your Browser is Out of Date

Preparation

Preparation

Installing Slurm