Skip to content

Slurm: "start request repeated too quickly" #212

@vlj91

Description

@vlj91

Found the following error with a single compute node when launching 32 compute nodes at once:

Sep 29 20:40:20 flight-149 systemd[1]: clusterware-slurm-slurmd.service: control process exited, code=exited status=1
Sep 29 20:40:20 flight-149 systemd[1]: Failed to start Alces Clusterware Slurm compute node daemon.
Sep 29 20:40:20 flight-149 systemd[1]: Unit clusterware-slurm-slurmd.service entered failed state.
Sep 29 20:40:20 flight-149 systemd[1]: clusterware-slurm-slurmd.service failed.
Sep 29 20:40:21 flight-149 systemd[1]: clusterware-slurm-slurmd.service holdoff time over, scheduling restart.
Sep 29 20:40:21 flight-149 systemd[1]: start request repeated too quickly for clusterware-slurm-slurmd.service
Sep 29 20:40:21 flight-149 systemd[1]: Failed to start Alces Clusterware Slurm compute node daemon.

Restarting the service fixes it

Process to repeat:

  • Start a cluster using the 2016.3rc6 template (professional edition)
  • Select slurm scheduler type
  • Launch 32 nodes
  • Node(s) may appear in sinfo -N as unknown state

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions