Skip to content

sandbox-image: bridge CLI to sandbox-proxy sign_blob#698

Open
sgirones wants to merge 10 commits into
mainfrom
salvador/dataplane-presign-url
Open

sandbox-image: bridge CLI to sandbox-proxy sign_blob#698
sgirones wants to merge 10 commits into
mainfrom
salvador/dataplane-presign-url

Conversation

@sgirones
Copy link
Copy Markdown
Contributor

@sgirones sgirones commented May 26, 2026

CLI side of the versioned-response rollout for sandbox-image builds.

Platform-api is moving away from returning a pre-signed upload block
from prepare-rootfs-build in favor of snapshotRelPath. When the CLI
sees the new shape it calls sandbox-proxy's new POST /api/v1/blob/sign
and splices the result back in as upload before handing the spec to
the in-sandbox builder. The legacy shape still works unchanged so we
can roll this out without lockstep deploys.

snapshotUri is also relaxed to optional on the prepared spec for the
same forward-compat reason; the completion path now resolves it from
the builder's metadata.json first, the prepared spec second, errors
if neither has it.

Notes for review

  • The upload block stays opaque JSON in the passthrough Value
    deliberately not typed, so future fields in the platform-api ↔
    in-sandbox-builder contract don't force an SDK release.
  • Always multipart on the new path. That's a rollout design choice,
    not something this PR decides. SinglePut is still on the wire as
    a variant.
  • Bindings (Python, Node) are untouched on purpose — they only see
    the final registered-template JSON, and #[serde(default)] on the
    new/relaxed fields keeps them compiling.

Test plan

  • Manual smoke against legacy platform-api (response with upload).
  • Manual smoke against new platform-api + sandbox-proxy sign_blob.
  • Sanity-check Python and Node bindings produce identical final-template JSON on the legacy path.

Feature: dataplane presigned URLs for image builder

This PR is part of a three-repo feature. The same explanation is appended to all three so reviewers can pick up cold from any of them.

Related PRs

Why

Move S3 URL signing for sandbox-template rootfs builds out of platform-api and into the regional dataplane. The dataplane already owns blob-store credentials for its region; this removes the last piece of S3 from platform-api and lets each region sign against its own bucket.

Components

  • platform-api — orchestrates the build, owns the rel-path namespace (projects/{project}/sandbox-template-builds/{build}/{snapshot}.tlsnap).
  • compute-engine-internal (dataplane) — exposes POST /api/v1/blob/sign on the HTTP proxy; composes the bucket URI from its own config + the caller's authenticated namespace, then signs.
  • tensorlake (CLI / SDK) — bridges the two: calls /prepare, asks the sandbox proxy to mint URLs, runs the in-sandbox builder.

Old flow (pre-signed at /prepare)

CLI ──/prepare──▶ platform-api
                    │ - allocates buildId / snapshotId
                    │ - signs upload (S3 multipart + complete/abort)
                    │ - signs parent-snapshot download (if diff build)
                    │ - bakes URLs into `prepared.upload` / `prepared.parent.download`
                    ▼
CLI ◀── prepared spec (with signed URLs) ──┘

CLI ──spec──▶ in-sandbox builder ──S3──▶ bucket
CLI ──/complete (snapshotUri == prepared.snapshotUri)──▶ platform-api

platform-api over-provisioned ~3× (≈ 528 parts default, up to 10 000) because it didn't see the CLI's --disk value.

New flow (signed at the dataplane)

CLI ──/prepare──▶ platform-api
                    │ - allocates buildId / snapshotId
                    │ - returns `snapshotRelPath` (no signed URLs)
                    ▼
CLI ◀── prepared spec (rel-path only) ──┘

CLI ──/api/v1/blob/sign {rel_path, op: MultipartPut{parts, part_size_bytes}}──▶ sandbox-proxy ──▶ dataplane
                                                              │ - validates X-Tensorlake-Sandbox-Id
                                                              │ - resolves caller's namespace locally
                                                              │ - enforces prefix: rel_path must start with
                                                              │   `projects/{namespace}/sandbox-template-builds/`
                                                              │ - composes {output_bucket}/{rel_path}, signs
                                                              ▼
CLI ◀── { uploadId, partUrls[], completeUrl, abortUrl } ──────┘

# If diff build, separately sign the parent manifest as a full URI:
CLI ──/api/v1/blob/sign {uri, op: SingleGet}──▶ dataplane ──▶ signed GET URL

CLI splices the signed blocks into the opaque builder spec
CLI ──spec──▶ in-sandbox builder ──S3──▶ bucket
CLI ──/complete (snapshotUri may differ from prepare, but rel-path suffix must match)──▶ platform-api

The CLI now sizes the part count from the actual rootfs disk budget (typically a few hundred 64 MiB parts), so the dataplane cap matches S3's 10 000-part ceiling without over-allocating.

Wire contract

  • The SDK treats the /api/v1/blob/sign response as opaque serde_json::Value and splices it verbatim into spec.upload / spec.parent.download. The platform-api ↔ in-sandbox-builder contract stays unchanged.
  • MAX_MULTIPART_PARTS = 10_000 is enforced on both sides (indexify/crates/dataplane/src/sign_blob.rs and tensorlake/crates/cloud-sdk/src/sandbox_images.rs); keep these in sync.
  • Presigned URL TTL: 7 days (S3 max), long enough for slow builds without mid-flight expiry.

Security posture

  • Namespace scoping: for rel_path signing, the dataplane substitutes the caller's authenticated namespace into the prefix — clients cannot sign URLs for another project's prefix.
  • Full-URI signing is SingleGet-only, used for parent snapshots which may live outside the caller's prefix. The dataplane will sign any URI its IAM identity can read; treat parent URIs as effectively public.
  • X-Tensorlake-Sandbox-Id is set (overwriting, not appending) by sandbox-proxy after authn. The dataplane HTTP proxy must not be directly reachable from inside sandboxes, or the header can be spoofed.
  • /complete still trusts a CLI-declared snapshotUri as long as its suffix matches the reconstructed rel-path. Tracked as phase-b follow-up: allowlist bucket origins against the dataplane fleet or have the dataplane attest completion.

Adds a CLI bridge so `build_sandbox_image` works against both the legacy
platform-api response (embedded pre-signed `upload` block) and the new
versioned-response shape (`snapshotRelPath` only). On the new path the
CLI calls the sandbox-proxy `POST /api/v1/blob/sign` endpoint and
splices the returned upload spec into the raw prepared spec before
handing spec.json to the in-sandbox rootfs builder.

The branch key (`snapshot_rel_path`) is the only field added to the
typed `PreparedSandboxTemplateBuild`. Everything else — including the
`upload` block from either path — stays opaque inside the raw
passthrough `Value`, preserving the property that future fields added
to the platform-api ↔ in-sandbox-builder contract don't require an SDK
release.

Always multipart on the new path with 100 MB parts, clamped to ≥ 1 and
saturated at u32::MAX; size hint reuses the existing
`rootfs_disk_bytes` precedence (explicit --disk_mb → parent's
rootfsDiskBytes for diff builds → default). Bindings (Python, Node)
are unchanged — they only see the final registered-template JSON.

Co-authored-by: Cursor <cursoragent@cursor.com>
Platform-api is moving the snapshot location off `snapshotUri` and onto
`snapshotRelPath` (the rel-path then gets resolved client-side via
`SandboxProxyClient::sign_blob`). Stop requiring `snapshotUri` on the
prepared-spec response so the CLI keeps deserializing once platform-api
drops the field.

The completion path now prefers the in-sandbox builder's metadata.json
for the final URI (it always knows where it landed the upload), falls
back to the prepared value for the legacy path, and errors clearly if
neither source provides one — instead of POSTing an empty string to
platform-api's complete endpoint.

Co-authored-by: Cursor <cursoragent@cursor.com>
`pick_upload_op` always returned `MultipartPut` — it "picked" nothing.
The whole helper, plus `disk_mb_for_upload`, plus the four boundary
tests, were just wrapping a one-line part-count computation around the
sole call site in `build_sandbox_image`. Inline it.

The splice now reuses the `rootfs_disk_bytes` value already computed
just upstream for builder sizing, so we don't recompute the same
precedence (explicit --disk_mb → parent rootfsDiskBytes for diff → default).

`MULTIPART_PART_SIZE_MB` stays as the one tunable, and the clamp /
saturation rationale moves into the comment at the call site.

Net -42 lines.

Co-authored-by: Cursor <cursoragent@cursor.com>
@sgirones sgirones changed the title sandbox-image: bridge CLI to sandbox-proxy sign_blob for upload presign sandbox-image: bridge CLI to sandbox-proxy sign_blob May 26, 2026
sgirones and others added 2 commits May 26, 2026 11:54
Drop `#[serde(rename_all = "camelCase")]` so `rel_path` goes on the wire
as `rel_path` to match the sandbox-proxy's expected payload shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the part size from 100 MiB to 64 MiB and cap the requested part
count at S3's 10,000-part limit so absurd disk budgets don't ask the
proxy to mint an invalid multipart op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sgirones sgirones requested a deployment to test-pypi May 26, 2026 10:44 — with GitHub Actions Abandoned
sgirones and others added 3 commits May 26, 2026 13:29
Extend SignBlobRequest to accept either a rel_path or a full uri and
add a SingleGet BlobOp so the proxy can presign downloads. When a
prepared spec includes a parent, fetch a signed download for the
parent manifest URI and inject it into the prepared spec.
Cross-reference MAX_MULTIPART_PARTS in the dataplane's sign_blob
endpoint so a future change to either side flags the other.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant