Home > Storage > PowerScale (Isilon) > Industry Solutions and Verticals > Electronic Design Automation > Synthetic EDA Workloads for At-Scale Storage Benchmarking > Preparation
To prepare for installing Slurm, do the following.
#export MUNGEUSER=3456
#groupadd -g $MUNGEUSER munge
#useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
Choose a unique UID/GID across the cluster.
#export SLURMUSER=3457
#groupadd -g $SLURMUSER slurm
#useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
Perform this procedure on all cluster nodes.
Note: Check the installation process for different Linux distributions: MUNGE by dun
#apt install munge libmunge2 libmunge-dev
#sudo systemctl enable munge
#sudo systemctl status munge
#sudo systemctl start munge
#dd if=/dev/urandom of=/etc/munge/munge.key bs=1c count=4M
#ls -l /etc/munge/munge.key
#chmod a-r /etc/munge/munge.key
#chmod u-w /etc/munge/munge.key
#chmod u+r /etc/munge/munge.key
#chown munge:munge /etc/munge/munge.key
#munge -n | unmunge | grep STATUS
STATUS: Success (0)
#scp -rp /etc/munge/munge.key <hostname>:/etc/munge/
#sudo systemctl enable munge
#sudo systemctl restart munge
#munge -n | ssh <hostname> unmunge
STATUS: Success (0)
#sudo apt install mariadb-server
#sudo mysql -u root
create database slurm_acct_db;
create database slurm_jobcomp_db;
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = password('slurmdbpass');
grant usage on *.* to 'slurm'@'localhost';
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
grant all privileges on slurm_jobcomp_db.* to 'slurm'@'localhost';
flush privileges;
exit
To install Slurm, do the following.
The latest version can be downloaded from the Slurm site. One of the compute servers will act as the head node. The head node has a centralized manager (slurmctld) to monitor resources and progress. Each compute server (node) will have a slurm daemon (slurmd).
The slurm-wlm package will install the slurmctld and slurmd daemons.
#sudo apt update
#sudo apt install slurm-wlm slurm-wlm-doc
#cd /usr/share/doc/slurmctld/
#python3 -m http.server
http://<ip address>:8000
#scp -rp /etc/slurm-llnl/slurm.conf <hostname>:/etc/slurm-llnl/
#sudo systemctl enable slurmctld
#sudo systemctl enable slurmdbd
#sudo systemctl enable slurmd
#sudo systemctl restart slurmctld
#sudo systemctl restart slurmdbd
#sudo systemctl restart slurmd
#sudo systemctl enable slurmd
#sudo systemctl restart slurmd
Cluster name : android
Control Machine : <hostname of head node>
Compute Machine : <slurmd -C will show detailed infomation>
Slurm user : slurm
Scheduling : backfill
Interconnect : none
Default MPI Type : none
Process Tracking : LinuxProc
Resource Selection : Cons_res
(individual sockets, cores and threads may be allocated)
SelectTypeParameter : CR_core
Task Launch : Affinity
Job Completion Logging : MySQL
Job Account Gather : Linux
Job Accountung Storage : slurmdbd
On the head node:
#sudo systemctl restart slurmctld
#sudo systemctl restart slurmdbd
#sudo systemctl restart slurmd
On each compute node:
#sudo systemctl enable slurmd
#sudo systemctl restart slurmd
On the head node:
root@hop-r940-01:~# sacctmgr add cluster android
Adding Cluster(s)
Name = android
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
root@hop-r940-01:~#
Note: scontrol can only be executed as root.
root@hop-r940-01:~# sinfo -lN
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
hop-r940-01 1 android_build* idle 96 4:24:1 309506 0 1 (null) none
hop-r940-03 1 android_build* idle 96 4:24:1 309506 0 1 (null) none
hop-r940-04 1 android_build* idle 96 4:24:1 309506 0 1 (null) none
root@hop-r940-01:~#
root@hop-r940-01:~# scontrol show partition
PartitionName=android_build
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=hop-r940-0[1,3,4]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=288 TotalNodes=3 SelectTypeParameters=NONE
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
root@hop-r940-01:~#
In this example, we execute /bin/hostname on three nodes (-N3) and include task numbers (-l).
One task per node will be used by default.
root@hop-r940-01:~# srun -N3 -l /bin/hostname
2: hop-r940-04
0: hop-r940-01
1: hop-r940-03
root@hop-r940-01:~#
In this example we execute /bin/hostname on two nodes (-N2)
root@hop-r940-01:~# srun -N2 -l /bin/hostname
0: hop-r940-01
1: hop-r940-03
root@hop-r940-01:~#
By default, it reports the running jobs in priority order, followed by the pending jobs in priority order.
In this example, we execute the ‘sleep 30’ command on three nodes, check the queue, then cancel tasks from all three nodes.
srun command from node 3
root@hop-r940-03:~# srun -N3 -l sleep 30
squeue/scancel from node 1
root@hop-r940-01:~# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
498 android_b sleep root R 0:02 3 hop-r940-[01,03-04]
root@hop-r940-01:~# scancel 498
root@hop-r940-01:~# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
root@hop-r940-01:~#