Skip to content

A jq error may set the status of a model to FAILED when vLLM seems to successfully launch #195

@JinyueF

Description

@JinyueF

Describe the bug

A model's status may show as FAILED (no base-url available) yet the slurm job is still up and running, taking up resources. It seems like in such cases, vLLM launched successfully. Here is the output of the .err file when this happens:

jq: error: writing output failed: Stale file handle
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.72s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.87s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:05<00:01,  1.94s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00,  1.87s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00,  1.87s/it]
(EngineCore_DP0 pid=72)
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:06<00:00,  7.40it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:04<00:00,  7.97it/s]
(APIServer pid=14) INFO:     Started server process [14]
(APIServer pid=14) INFO:     Waiting for application startup.
(APIServer pid=14) INFO:     Application startup complete.

To Reproduce

I used the standard python API call (client.launch_model). The behaviour does not consistently show up.

Following is the most recent occurrence on Killarney:
slurmstepd: error: *** JOB 2241787 ON kn120 CANCELLED AT 2026-02-20T22:41:49 ***

Expected behavior

From the output of vec-inf logs, the models are loaded and running. The status should be READY if this is indeed the case.
Otherwise, if the launch FAILED, the job should end and take up no computing resources.

Screenshots

If applicable, add screenshots to help explain your problem.

Version

v0.8.1

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions