Skip to content

fix(register): add Phase 3 to the bearer registration-session flow#134

Open
christophergeyer wants to merge 2 commits into
mainfrom
fix/bearer-coordinator-phase3
Open

fix(register): add Phase 3 to the bearer registration-session flow#134
christophergeyer wants to merge 2 commits into
mainfrom
fix/bearer-coordinator-phase3

Conversation

@christophergeyer
Copy link
Copy Markdown
Member

What

Adds the missing Phase 3 step to roar's bearer-flow registration coordinator. Companion to glaas-api PR #50 which adds the matching server endpoint.

Why

roar's 4-phase register flow:

Phase Standard coordinator Bearer coordinator (before this PR)
2 — create jobs
3 — register artifacts (full metadata) POST /api/v1/artifacts/batch ❌ no step, no endpoint
4 — link artifacts to jobs

The bearer coordinator had no Phase 3. Artifact rows came into existence only by Phase 4 link requests falling back to size: BigInt(artifact.size ?? 0) server-side.

The auth-based routing in roar/application/publish/session.py:391 routes essentially every modern user (bearer token OR SSH keys OR anonymous_public-capable server) to the bearer flow. So this missing Phase 3 is the actual root cause of the 247 zero-byte artifact stubs on the M1 nanochat DAG (ef174b91…) — 100 % of M1-new artifacts.

Change

File Change
integrations/glaas/client.py New register_artifacts_batch_under_registration_session. Hits the new endpoint with bearer auth, strips session_hash from per-artifact payload.
integrations/glaas/registration/artifact.py New ArtifactRegistrationService.register_batch_under_registration_session mirroring register_batch's validation + size-based batching + error reporting, just targeting the new transport.
integrations/glaas/registration/coordinator.py register_lineage_under_registration_session takes an artifacts parameter and runs Phase 3 between Phase 2 and Phase 4. Phase 4 still plows ahead regardless of Phase 3 outcome (parity with the standard coordinator).
application/publish/register_execution.py The bearer-flow call site runs prepare_batch_registration_artifacts (same helper the standard flow uses) and passes the result to the coordinator.

Backwards compat

  • artifacts is optional on the coordinator method. Existing callers that don't pass it get the previous behavior (no Phase 3; the link endpoint's permissive stub-create remains server-side).
  • New roar versions against old glaas (no staged-artifact endpoint): Phase 3 returns a network/404 error which lands in errors[], but Phase 4 still tries — same shape as today's bug. Functional, just doesn't fix anything.
  • New roar against new glaas (with PR feat: Route roar auth and project flows through GLaaS #50 merged): Phase 3 succeeds; Phase 4 link finds existing rows; sizes are correct.

A follow-up glaas-api PR will flip registerRegistrationSessionJobInputs/Outputs to PR #49-style strict behavior (no implicit stub-create). Once that lands AND this roar version is adopted, the M1 bug class is closed entirely. The phasing means there's no breaking-change window — adoption can be at a normal pace.

Tests

tests/unit/test_coordinator.py — 3 new tests in a new TestRegisterLineageUnderRegistrationSession class:

  • test_phase3_runs_with_artifacts — Phase 3 is called with the artifacts list before Phase 4 link runs.
  • test_phase3_skipped_when_no_artifacts — backwards-compat: omitted artifacts ⇒ no Phase 3 call.
  • test_phase3_errors_collected_but_does_not_block_phase4 — Phase 3 errors surface in errors[] but Phase 4 still runs (parity with the standard register_lineage).

All 9 coordinator tests + 137 broader publish tests pass.

🤖 Generated with Claude Code

chrisgeyertreqs and others added 2 commits May 30, 2026 02:03
Companion to https://github.com/treqs-inc/glaas-api/pull/50 which adds the
matching server endpoint.

# Background

roar's 4-phase register flow has, for the standard session coordinator:

  Phase 2: create jobs (no I/O)
  Phase 3: POST /api/v1/artifacts/batch     # full metadata (hash, size, …)
  Phase 4: POST /sessions/:h/jobs/:uid/inputs|outputs

The bearer (registration-session) coordinator had Phase 2 and Phase 4
but NEVER had a Phase 3 — there was simply no server endpoint, no
client method, no coordinator step. Artifact rows came into existence
only by Phase 4 link requests falling back to
'size: BigInt(artifact.size ?? 0)' server-side.

The session-selection logic in roar/application/publish/session.py:391
routes essentially every modern user (bearer token OR SSH keys OR
anonymous_public-capable server) to the bearer flow. So this missing
Phase 3 was the actual root cause of the 247 0-byte artifact stubs on
the M1 nanochat DAG (ef174b91…) — 100% of M1-new artifacts.

# Change

- GlaasClient.register_artifacts_batch_under_registration_session():
  new client method. Hits the new endpoint
  /api/v1/registration-sessions/:id/artifacts/batch with bearer auth.
  Strips 'session_hash' from each artifact payload since the URL scopes
  the request.

- ArtifactRegistrationService.register_batch_under_registration_session():
  mirrors register_batch (validation, size-based batching, error
  reporting) but targets the new transport. Reuses
  validate_artifact_registration; the registration_session_id stands in
  for session_hash to satisfy the non-placeholder check, then is
  stripped at the transport layer.

- RegistrationCoordinator.register_lineage_under_registration_session():
  accepts a new 'artifacts' parameter; runs Phase 3 between Phase 2
  (create_jobs_batch_under_registration_session) and Phase 4
  (link_job_artifacts_under_registration_session). Phase 4 still plows
  ahead regardless of Phase 3 outcome — matching the standard flow's
  behavior — so failures surface as link 404s once the corresponding
  glaas-api change lands.

- RegisterService.register() (register_execution.py): the call site
  now runs prepare_batch_registration_artifacts (same helper the
  standard flow uses) on lineage.artifacts and passes them to the
  coordinator.

# Backwards compat

- artifacts is optional on the coordinator method (default None).
  Old callers that don't pass it get the previous behavior (no
  Phase 3, link endpoint still permissively stub-creates server-side).
- Server-side: glaas-api PR #50 is additive; old roar versions continue
  to work against new glaas. New roar versions safely no-op against old
  glaas that don't have the staged endpoint yet (Phase 3 returns a
  network error which lands in errors[] but Phase 4 still tries — same
  shape as today).

A follow-up server PR will flip
registerRegistrationSessionJobInputs/Outputs to PR #49-style strict
behavior (no stub-create). Once the new roar version is adopted, that
PR closes the M1 bug class entirely.

# Tests

tests/unit/test_coordinator.py:
  - test_phase3_runs_with_artifacts: Phase 3 is called with the
    artifacts list before Phase 4 link runs.
  - test_phase3_skipped_when_no_artifacts: backwards-compat — omitted
    artifacts mean no Phase 3 call.
  - test_phase3_errors_collected_but_does_not_block_phase4: Phase 3
    errors surface in errors[] but Phase 4 still runs (parity with the
    standard register_lineage).

All 9 coordinator tests + 137 broader publish tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When roar (with the new Phase 3 step) talks to a glaas instance that
doesn't yet have the staged-artifact endpoint
(https://github.com/treqs-inc/glaas-api/pull/50), the POST returns 404.
Previously this surfaced as 'Staged batch registration error: HTTP 404'
and failed the whole 'roar register' invocation.

Fall back silently instead: log an info, return clean from Phase 3, and
let Phase 4's implicit stub-create work as the legacy fallback. (Yes,
that's still the M1 bug — but it preserves the *current* permissive
behavior for old glaas instances, which is no worse than before.)

The 404-skip only applies on the FIRST batch with no prior successes.
A 404 after successful batches would be a real anomaly (endpoint can't
disappear mid-register) and gets surfaced normally.

Tests (tests/unit/test_artifact_registration_phase3_fallback.py, 3 tests):
- 404 on first batch ⇒ no errors surfaced
- non-404 errors still surface
- 404 after a successful batch surfaces normally

This makes the failing CI test
test_register_private_without_project_binding_uses_current_user_scope
pass — its stub server returned 404 for the unknown new endpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants