Skip to content

[Harbor] Proper cleanup of Trials (especially address sandbox leakage) upon RL run fails / KeyboardInterrupt #1194

@CharlieFRuan

Description

@CharlieFRuan

Currently, when an RL run fails (due to whatever reason) or the user Cotnrol-C the experiment, there is a high chance that there will be leaked sandboxes. They will live until a pre-configured timeout (say ~30 minutes).

While this is benign, it can be an issue when:

  • A user only has a limited concurrency budget from a sandbox provider (e.g. 1000 sandboxes)
  • If an experiment needs 1000 sandboxes, and the user has a retry script that relaunches the SkyRL experiment at the previous checkpoint, where we now need a new set of 1000 concurrent sandboxes, the user will run out of concurrency budget, making the retry fail.

While Harbor has Trial cleanup logics implemented, the intricacies revolving around uv run, ray make it relatively hard to actually allow those cleanup logics to run.

This PR that changes from tqdm.gather() to TaskGroup partially solves the issue: #1193

Related change in Harbor required to fully solve the issue: harbor-framework/harbor#819

These two branches (slightly different approach), claude-coded, can solve the issue but don't look very elegant:

Or, use Harbor's Orchestrator construct instead, and implement an API with the semantics of "clean up all the sandboxes during this session" that we run at the experiment level. This solution might be the cleanest. Will revisit when we migrate HarborGenerator to use Orchestrator instead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions