Found the following error with a single compute node when launching 32 compute nodes at once:
Sep 29 20:40:20 flight-149 systemd[1]: clusterware-slurm-slurmd.service: control process exited, code=exited status=1
Sep 29 20:40:20 flight-149 systemd[1]: Failed to start Alces Clusterware Slurm compute node daemon.
Sep 29 20:40:20 flight-149 systemd[1]: Unit clusterware-slurm-slurmd.service entered failed state.
Sep 29 20:40:20 flight-149 systemd[1]: clusterware-slurm-slurmd.service failed.
Sep 29 20:40:21 flight-149 systemd[1]: clusterware-slurm-slurmd.service holdoff time over, scheduling restart.
Sep 29 20:40:21 flight-149 systemd[1]: start request repeated too quickly for clusterware-slurm-slurmd.service
Sep 29 20:40:21 flight-149 systemd[1]: Failed to start Alces Clusterware Slurm compute node daemon.
Restarting the service fixes it
Process to repeat:
- Start a cluster using the
2016.3rc6 template (professional edition)
- Select
slurm scheduler type
- Launch 32 nodes
- Node(s) may appear in
sinfo -N as unknown state
Found the following error with a single compute node when launching 32 compute nodes at once:
Restarting the service fixes it
Process to repeat:
2016.3rc6template (professional edition)slurmscheduler typesinfo -Nasunknownstate