Instructions for deploying a GPU cluster with Slurm
- Control system to run the install process
- One server to act as the Slurm controller/login node
- One or more servers to act as the Slurm compute nodes
-
Install a supported operating system on all nodes.
Install a supported operating system on all servers via a 3rd-party solution (i.e. MAAS, Foreman) or utilize the provided OS install container.
-
Set up your provisioning machine.
This will install Ansible and other software on the provisioning machine which will be used to deploy all other software to the cluster. For more information on Ansible and why we use it, consult the Ansible Guide.
# Install software prerequisites and copy default configuration ./scripts/setup.sh -
Create and edit the Ansible inventory.
Ansible uses an inventory which outlines the servers in your cluster. The setup script from the previous step will copy an example inventory configuration to the
configdirectory.Edit the inventory:
# Edit inventory # Add Slurm controller/login host to `slurm-master` group # Add Slurm worker/compute hosts to the `slurm-node` groups vi config/inventory # (optional) Modify `config/group_vars/*.yml` to set configuration parameters
-
Verify the configuration.
ansible all -m raw -a "hostname" -
Install Slurm.
# NOTE: If SSH requires a password, add: `-k` # NOTE: If sudo on remote machine requires a password, add: `-K` # NOTE: If SSH user is different than current user, add: `-u ubuntu` ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml
-
Verify Pyxis and Enroot can run GPU jobs across all nodes.
# NOTE: This will use Pyxis to download a container and verify GPU functionality across all compute nodes
ansible-playbook -l slurm-cluster playbooks/slurm-validation.yml -e '{num_gpus: 1}'Now that Slurm is installed, try a "Hello World" example using MPI.
As part of the Slurm installation, Grafana and Prometheus are both deployed.
The services can be reached from the following addresses:
- Grafana: http://<slurm-master>:3000
- Prometheus: http://<slurm-master>:9090
For information about configuring a shared NFS filesystem on your Slurm cluster, see the documentation on Slurm and NFS.
You may optionally choose to install a tool for managing additional packages on your Slurm cluster. See the documentation on software modules for information on how to set this up.
Open OnDemand can be installed by setting the install_open_ondemand variable to yes before running the slurm-cluster.yml playbook.
Pyxis and Enroot are installed by default and can be disabled by setting slurm_install_enroot and slurm_install_pyxis to no. Singularity can be installed by setting the slurm_cluster_install_singularity variable to yes before running the slurm-cluster.yml playbook.