Skip to content

ci(gpu): add separate GPU test workflows#773

Open
pimlock wants to merge 7 commits intomainfrom
OS-13-gpu-runner-smoke-test
Open

ci(gpu): add separate GPU test workflows#773
pimlock wants to merge 7 commits intomainfrom
OS-13-gpu-runner-smoke-test

Conversation

@pimlock
Copy link
Copy Markdown
Collaborator

@pimlock pimlock commented Apr 6, 2026

Summary

  • add a dedicated test-gpu.yml workflow for validating the new shared NVIDIA GPU runner infrastructure
  • keep the existing branch E2E and branch checks workflows unchanged for now
  • add the repository-side copy-pr-bot configuration required for workflows that run on NVIDIA hosted workers

Why copy-pr-bot is part of this PR

The new GPU runners are NVIDIA-hosted workers, so on this public repository we cannot rely on ordinary pull_request jobs to run code on them. The trusted path is:

  • a pull request is reviewed
  • copy-pr-bot syncs the exact PR head SHA into pull-request/<number> in the source repository
  • the GPU workflow runs from that trusted push event

This PR sets up that path in OpenShell. Anything else we later want to run on these hosted workers will need the same trusted pull-request/* mechanism.

GPU test scope right now

The GPU coverage here is intentionally a smoke test, not a deep framework validation:

  • it boots the cluster with GPU passthrough enabled
  • it runs the existing GPU E2E test that checks device visibility via nvidia-smi

Deeper validation, such as a PyTorch-backed CUDA test with a dedicated community GPU image, is tracked separately in OS-26.

Why this is separate from existing branch checks

This is deliberately separate from the current branch checks and standard branch E2E flow. The goal of this PR is to prove out the new shared GPU infra and the trusted pull-request/* execution model first. Once that path is stable, we can decide whether to consolidate other checks onto the same mechanism.

Changes

  • add .github/workflows/test-gpu.yml as the top-level GPU workflow
  • add .github/workflows/e2e-gpu-test.yaml as the reusable GPU matrix workflow
  • keep .github/workflows/branch-e2e.yml and .github/workflows/e2e-test.yml focused on the existing standard E2E path
  • add .github/copy-pr-bot.yaml so this repo can use the trusted pull-request/* sync flow

Testing

  • validated workflow YAML locally
  • verified the trusted pull-request/* path by mirroring this PR head to pull-request/773
  • verified the GPU workflow gates on PR metadata from nv-gha-runners/get-pr-info
  • verification run: https://github.com/NVIDIA/OpenShell/actions/runs/24049186680
  • Linux arm64 and Linux amd64 smoke-test legs passed on the trusted branch run; WSL remains experimental and non-blocking

Checklist

  • Scoped to GPU workflow and trusted-runner setup
  • No secrets or credentials added
  • Standard branch E2E remains unchanged for now
  • copy-pr-bot repo config added
  • Trusted pull-request/* GPU workflow path verified

@pimlock pimlock added the test:e2e-gpu Requires GPU end-to-end coverage label Apr 6, 2026
@pimlock pimlock removed the test:e2e-gpu Requires GPU end-to-end coverage label Apr 6, 2026
@pimlock pimlock self-assigned this Apr 6, 2026
@pimlock pimlock added the test:e2e-gpu Requires GPU end-to-end coverage label Apr 6, 2026
@pimlock pimlock changed the title ci(actions): add separate GPU test workflows ci(gpu): add separate GPU test workflows Apr 6, 2026
@pimlock
Copy link
Copy Markdown
Collaborator Author

pimlock commented Apr 6, 2026

Note: the WSL job is failing is expected, it's marked as "okay to fail" for now and it will start working once this PR gets improved and is #608.

@pimlock pimlock marked this pull request as ready for review April 6, 2026 20:36
@pimlock pimlock requested a review from a team as a code owner April 6, 2026 20:36
@@ -0,0 +1,3 @@
enabled: true
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is a config for the copy-pr-bot, which is an extra security measure for running tests when the code may be coming from an external fork (see https://docs.gha-runners.nvidia.com/platform/apps/copy-pr-bot/).

However, the mechanism applies to regular branches as well, but the process is automatic, as long as the commit is signed.

@pimlock pimlock requested a review from drew April 6, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant