Skip to content

slurm: prevent nodehost column truncation in get_cluster_info#946

Open
samoz83 wants to merge 1 commit into
OSC:masterfrom
samoz83:bug-fix/cluster-info-gpus-945
Open

slurm: prevent nodehost column truncation in get_cluster_info#946
samoz83 wants to merge 1 commit into
OSC:masterfrom
samoz83:bug-fix/cluster-info-gpus-945

Conversation

@samoz83
Copy link
Copy Markdown

@samoz83 samoz83 commented May 28, 2026

Please review our Contributing Guide before submitting a pull request.

What does this PR do?

Fixes OodCore::Job::Adapters::Slurm::Batch#get_cluster_info so that GPU
counts are parsed correctly on clusters with node FQDNs longer than
20 characters.

sinfo -O uses a default width of 20 characters for the nodehost field.
When a hostname exceeds that width, sinfo truncates it and emits no
separating whitespace before the next column, causing line.split to
return 3 tokens instead of 4. The gres and gresused columns are then
read from the wrong indices: total_gpus ends up summing GresUsed (the
in-use count) and active_gpus tries to parse the state string
("mixed", "idle", …), which the GPU regex never matches, so it
evaluates to 0. The dashboard's System Status widget consequently renders
"GPUs Available" as the number of GPUs currently in use.

The fix sizes the nodehost column dynamically from sinfo -ho %n,
mirroring how gres_length is already calculated, so the format string
always reserves enough space to keep the columns separated regardless of
hostname length.

Related issue

Closes #945

Testing

  • Tests included
  • No tests needed — reason: ___

Checklist

  • Follows project code style and conventions
  • Documentation provided (if new feature, adapter or behavior change)
  • This is a large feature and was discussed in an issue first (if applicable)

Anything else?

The fix has been validated on a real cluster running OOD 4.2.2 with
ood_core 0.31.1: applied as a dashboard initializer monkey-patch first
(to confirm), then upstreamed here. After applying, the System Status
widget shows the correct configured − in_use GPU count rather than the
in-use count.

@johrstrom
Copy link
Copy Markdown
Contributor

I'm really unsatisfied with the fact that we make multiple calls to find the max length. Can't we just set the limit to some super high number like 100? This would save us the additional sinfo call. And yes we likely need to do it for the gres_length as well, just set it to some arbitrarily high number up front without having to make that sinfo call too.

The sinfo -O nodehost field uses a default 20-char column width.
On clusters whose node FQDNs exceed 20 characters, sinfo truncates the
hostname to 20 chars with no separating whitespace before the next
column. line.split then yields one fewer token than expected, shifting
the gresused string into the slot the code reads as the configured
gres, and the state string into the slot read as gresused.

Net effect: total_gpus is computed from GresUsed (the in-use count) and
active_gpus parses a state word and evaluates to 0. The dashboard's
System Status widget then displays "GPUs Available" as the number of
GPUs currently in use.

Pass explicit widths to every sinfo -O field so columns can't run
together regardless of hostname or gres string length:

  - nodehost: 100
  - gres / gresused: 512 (the existing test fixture already contains
    multi-GRES strings up to 238 chars; 512 gives ~2x headroom)

This also drops the auxiliary 'sinfo -o %G' call that previously
computed a dynamic gres column width, reducing per-refresh subprocess
overhead from three sinfo calls to two.
@samoz83 samoz83 force-pushed the bug-fix/cluster-info-gpus-945 branch from 2d4be34 to 31f3b36 Compare May 28, 2026 15:40
@samoz83
Copy link
Copy Markdown
Author

samoz83 commented May 28, 2026

Pushed an update that drops both dynamic-sizing sinfo calls and uses fixed widths instead. Picked 512 for gres/gresused as it was double the longest gres string in the existing test fixture.

Also added a regression test so future refactors don't silently re-trigger the truncation case.

@johrstrom johrstrom self-requested a review May 29, 2026 15:41
Copy link
Copy Markdown
Contributor

@johrstrom johrstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

System Status GPUs Available shows in-use count when node FQDNs exceed 20 chars

2 participants