From 546646878819c089c9e46a82af649d98e70043f5 Mon Sep 17 00:00:00 2001 From: JacobPEvans <20714140+JacobPEvans@users.noreply.github.com> Date: Fri, 15 May 2026 08:42:14 -0400 Subject: [PATCH 1/2] feat(infra-standards): add self-hosted-runners skill for RunsOn migration Activates on .github/workflows/*.yml file edits to document when a workflow should target self-hosted RunsOn runners vs github-hosted ones, the v3 label catalog used across JacobPEvans repos, and the prereq that the RunsOn GitHub App must have the repo in its allowlist. This is the authoring-time companion to terraform-runs-on/docs/migration-guide.md (the deployment-time playbook). The skill covers what reusable workflows in JacobPEvans/.github do with the runner_label input, how to identify whether a run actually landed on RunsOn (RUNS_ON_VERSION env var), and which workloads explicitly do NOT migrate (macOS, gh-aw lock files, disabled-schedule workflows). Assisted-by: Claude --- infra-standards/.claude-plugin/plugin.json | 5 +- infra-standards/README.md | 4 + .../skills/self-hosted-runners/SKILL.md | 115 ++++++++++++++++++ 3 files changed, 122 insertions(+), 2 deletions(-) create mode 100644 infra-standards/skills/self-hosted-runners/SKILL.md diff --git a/infra-standards/.claude-plugin/plugin.json b/infra-standards/.claude-plugin/plugin.json index 5e184e5..6a07ba1 100644 --- a/infra-standards/.claude-plugin/plugin.json +++ b/infra-standards/.claude-plugin/plugin.json @@ -7,8 +7,9 @@ }, "license": "MIT", "repository": "https://github.com/JacobPEvans/claude-code-plugins", - "keywords": ["infrastructure", "terraform", "ansible", "proxmox", "nix", "devshell"], + "keywords": ["infrastructure", "terraform", "ansible", "proxmox", "nix", "devshell", "runs-on", "github-actions"], "skills": [ - "./skills/infrastructure-standards" + "./skills/infrastructure-standards", + "./skills/self-hosted-runners" ] } diff --git a/infra-standards/README.md b/infra-standards/README.md index a651268..2d74eaa 100644 --- a/infra-standards/README.md +++ b/infra-standards/README.md @@ -5,6 +5,9 @@ Infrastructure standards for Proxmox, Terraform, Ansible including deployment pi ## Skills - **`/infrastructure-standards`** - Deployment pipeline, VMID/IP mapping, dev shells, Doppler/SOPS, Terraform inventory +- **`/self-hosted-runners`** - When to target RunsOn vs github-hosted runners in + `.github/workflows/*.yml`, the v3 label catalog used across the org, the required + `github.run_id` segment, and the GitHub App allowlist prereq ## Installation @@ -16,6 +19,7 @@ claude plugins add jacobpevans-cc-plugins/infra-standards ```text /infrastructure-standards +/self-hosted-runners ``` ## License diff --git a/infra-standards/skills/self-hosted-runners/SKILL.md b/infra-standards/skills/self-hosted-runners/SKILL.md new file mode 100644 index 0000000..4867b82 --- /dev/null +++ b/infra-standards/skills/self-hosted-runners/SKILL.md @@ -0,0 +1,115 @@ +--- +name: self-hosted-runners +description: Use when editing GitHub Actions workflow files (.github/workflows/*.yml) in JacobPEvans repos. Documents when to target self-hosted RunsOn runners vs github-hosted runners, the v3 label catalog used across the org, the required github.run_id segment, and the prerequisite GitHub App allowlist. +--- + +# Self-Hosted Runners (RunsOn) + +JacobPEvans repos use self-hosted RunsOn runners deployed by +[terraform-runs-on](https://github.com/JacobPEvans/terraform-runs-on) for +Linux GitHub Actions jobs. The infrastructure is paid for whether it runs +or not (~$3.50/month of fixed App Runner + CloudWatch). Workflows that +stay on `ubuntu-latest` waste GitHub Actions minutes that don't need to +be spent. Migrate any Linux job in the org that isn't covered by the +"do not migrate" list below. + +## When to target RunsOn + +| Workload | Decision | +| --- | --- | +| Linux job (lint, validate, build, test) | **RunsOn** — almost always | +| Job that needs `nix flake check --all-systems` | **RunsOn** with more RAM (see catalog) | +| Job that runs on `macos-latest` | **GitHub-hosted** — RunsOn EC2 Mac has a 24-hour minimum allocation, costs more than `macos-latest` for short jobs | +| Job that runs on `windows-latest` | **RunsOn** supports Windows; treat case-by-case | +| Job generated by `gh-aw compile` (`*.lock.yml`) | **GitHub-hosted** — lock file is regenerated; runner label must flow through the `.md` companion (gh-aw doesn't expose this yet) | +| Job with disabled `schedule:` (manual dispatch only, rarely runs) | **GitHub-hosted** — migration saves nothing | +| Job in a repo that hasn't been added to the RunsOn GitHub App allowlist | **GitHub-hosted** — install the app first | + +## Runner label catalog + +Use the single-string format. The leading `runs-on=${{ github.run_id }}` +segment is **required** so the RunsOn control plane can correlate the +GitHub Actions `workflow_job` webhook back to the originating run. +Omitting it makes the job hang in `queued` forever. + +| Workload | Label string | +| --- | --- | +| Standard step (lint, validate, small build) | `runs-on=${{ github.run_id }}/runner=2cpu-linux-x64` | +| Nix `flake check` (Linux only) | `runs-on=${{ github.run_id }}/cpu=4/ram=16/family=m7+c7/extras=s3-cache` | +| Build with large dependency cache | `runs-on=${{ github.run_id }}/cpu=4/ram=16/volume=80gb:gp3:500mbs:4000iops/extras=s3-cache` | +| Heavy CPU (terraform plan over many modules) | `runs-on=${{ github.run_id }}/cpu=8/ram=32/family=c7a` | +| GPU (Hugging Face, MLX cross-eval) | `runs-on=${{ github.run_id }}/family=g4dn.xlarge/image=ubuntu22-gpu-x64` | + +## Pattern in YAML + +```yaml +jobs: + validate: + runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64" + steps: + - uses: actions/checkout@v6 + - run: ... +``` + +For reusable workflows in `JacobPEvans/.github`, callers pass the label +through the `runner_label` input (default `ubuntu-latest`): + +```yaml +jobs: + markdown-lint: + uses: JacobPEvans/.github/.github/workflows/_markdown-lint.yml@main + with: + runner_label: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64" +``` + +Each reusable workflow's `runs-on:` line then expands the input: + +```yaml +jobs: + lint: + runs-on: ${{ inputs.runner_label }} +``` + +This keeps `ubuntu-latest` as the safe default for consumers that haven't +opted in. + +## Prerequisites for a new repo + +1. The RunsOn CloudFormation stack must be applied (`terraform-runs-on/main`). +2. The RunsOn GitHub App must be installed on the target repo (either + organization-wide with "All repositories" selected, or the repo added + individually under the App settings). +3. The first migrated workflow should be a low-risk canary (a lint or + validate job, not the whole `Merge Gate`). Watch one run end-to-end + before migrating the rest. + +## Identifying RunsOn vs github-hosted in a run + +In the GitHub Actions UI, expand the `Set up runner` group on any step. +A RunsOn run prints: + +```text +RUNS_ON_VERSION: v3.x.x +RUNS_ON_INSTANCE_ID: i-... +RUNS_ON_INSTANCE_TYPE: m8i.large +RUNS_ON_INSTANCE_LIFECYCLE: spot +``` + +If these are missing despite the `runs-on=...` label, GitHub silently fell +back to the default github-hosted pool because the label didn't match a +configured runner (most often: the repo isn't in the App allowlist, or +`${{ github.run_id }}` was missing from the label). + +## Cost allocation + +Every RunsOn-launched EC2 instance is tagged with `runs-on=...`. AWS Cost +Explorer can be filtered by that tag group to attribute spend per repo, +workflow, and job. No per-workflow setup is needed — the tag is applied +by the RunsOn control plane. + +## Related + +- [Migration guide](https://github.com/JacobPEvans/terraform-runs-on/blob/main/docs/migration-guide.md) — the canonical playbook for migrating a single repo +- [RunsOn v3 job labels](https://runs-on.com/configuration/job-labels/) — upstream label spec +- [terraform-runs-on README](https://github.com/JacobPEvans/terraform-runs-on/blob/main/README.md) — infra deployment +- `ci-cd-policy` rule (auto-loaded org rule) — billing/runner policy From a50b1877336901ec847f58be2d5332440b1f2252 Mon Sep 17 00:00:00 2001 From: JacobPEvans <20714140+JacobPEvans@users.noreply.github.com> Date: Fri, 15 May 2026 09:12:04 -0400 Subject: [PATCH 2/2] fix(self-hosted-runners): address review feedback on SKILL.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three corrections caught in review of #308: 1. **YAML frontmatter description** was 306 chars on one line, exceeding the repo's markdownlint MD013 line-length limit (160). Wrapped onto multiple lines with the `>-` folded scalar form. The skill still triggers on the same content; only the line breaks change. 2. **"Silent fallback" diagnostic was wrong.** GitHub does NOT silently route a job back to github-hosted compute when a custom `runs-on=...` label is unmatched — the job hangs in `queued` state waiting for a matching runner. Rewrote the "identifying RunsOn vs github-hosted" section to describe the actual failure mode (queue stall) and point at the `_ci-gate.yml` watchdog as the safety net. 3. **Hardcoded $3.50/month estimate** would rot as App Runner / CloudWatch pricing changes. Replaced with a link to terraform-runs-on/README.md (where the canonical estimate lives). Also reworded the "do not migrate list below" reference to point at the decision table rather than a nonexistent named list. Assisted-by: Claude --- .../skills/self-hosted-runners/SKILL.md | 33 +++++++++++++------ 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/infra-standards/skills/self-hosted-runners/SKILL.md b/infra-standards/skills/self-hosted-runners/SKILL.md index 4867b82..878aecd 100644 --- a/infra-standards/skills/self-hosted-runners/SKILL.md +++ b/infra-standards/skills/self-hosted-runners/SKILL.md @@ -1,17 +1,23 @@ --- name: self-hosted-runners -description: Use when editing GitHub Actions workflow files (.github/workflows/*.yml) in JacobPEvans repos. Documents when to target self-hosted RunsOn runners vs github-hosted runners, the v3 label catalog used across the org, the required github.run_id segment, and the prerequisite GitHub App allowlist. +description: >- + Use when editing GitHub Actions workflow files (.github/workflows/*.yml) + in JacobPEvans repos. Documents when to target self-hosted RunsOn runners + vs GitHub-hosted runners, the v3 label catalog used across the org, the + required github.run_id segment, and the GitHub App allowlist prereq. --- # Self-Hosted Runners (RunsOn) JacobPEvans repos use self-hosted RunsOn runners deployed by [terraform-runs-on](https://github.com/JacobPEvans/terraform-runs-on) for -Linux GitHub Actions jobs. The infrastructure is paid for whether it runs -or not (~$3.50/month of fixed App Runner + CloudWatch). Workflows that -stay on `ubuntu-latest` waste GitHub Actions minutes that don't need to -be spent. Migrate any Linux job in the org that isn't covered by the -"do not migrate" list below. +Linux GitHub Actions jobs. The control plane has a fixed monthly cost +whether or not jobs run (App Runner + CloudWatch — see +[terraform-runs-on/README.md](https://github.com/JacobPEvans/terraform-runs-on/blob/main/README.md) +for the current estimate). Workflows that stay on `ubuntu-latest` waste +GitHub Actions minutes that don't need to be spent. Migrate any Linux job +in the org that isn't covered by the **GitHub-hosted** rows in the decision +table below. ## When to target RunsOn @@ -95,10 +101,17 @@ RUNS_ON_INSTANCE_TYPE: m8i.large RUNS_ON_INSTANCE_LIFECYCLE: spot ``` -If these are missing despite the `runs-on=...` label, GitHub silently fell -back to the default github-hosted pool because the label didn't match a -configured runner (most often: the repo isn't in the App allowlist, or -`${{ github.run_id }}` was missing from the label). +If those variables are missing despite the `runs-on=...` label, the job +didn't land on RunsOn. GitHub does **not** silently fall back to github-hosted +when a custom label is unmatched — the job sits in `queued` state waiting +for a runner that never picks it up. Most common causes: the repo isn't in +the RunsOn GitHub App allowlist, the `${{ github.run_id }}` segment is +missing from the label so the control plane can't correlate the +`workflow_job` webhook, or AWS spot capacity for the requested family is +briefly exhausted (RunsOn v3's spot circuit breaker handles this but a +queue stall can still happen during the fallback). The `_ci-gate.yml` +watchdog in `JacobPEvans/.github` cancels any job stuck in `queued` after +`queue_timeout_minutes` so the merge gate isn't blocked indefinitely. ## Cost allocation