Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions infra-standards/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@
},
"license": "MIT",
"repository": "https://github.com/JacobPEvans/claude-code-plugins",
"keywords": ["infrastructure", "terraform", "ansible", "proxmox", "nix", "devshell"],
"keywords": ["infrastructure", "terraform", "ansible", "proxmox", "nix", "devshell", "runs-on", "github-actions"],
"skills": [
"./skills/infrastructure-standards"
"./skills/infrastructure-standards",
"./skills/self-hosted-runners"
]
}
4 changes: 4 additions & 0 deletions infra-standards/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ Infrastructure standards for Proxmox, Terraform, Ansible including deployment pi
## Skills

- **`/infrastructure-standards`** - Deployment pipeline, VMID/IP mapping, dev shells, Doppler/SOPS, Terraform inventory
- **`/self-hosted-runners`** - When to target RunsOn vs github-hosted runners in
`.github/workflows/*.yml`, the v3 label catalog used across the org, the required
`github.run_id` segment, and the GitHub App allowlist prereq

## Installation

Expand All @@ -16,6 +19,7 @@ claude plugins add jacobpevans-cc-plugins/infra-standards

```text
/infrastructure-standards
/self-hosted-runners
```

## License
Expand Down
128 changes: 128 additions & 0 deletions infra-standards/skills/self-hosted-runners/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
name: self-hosted-runners
description: >-
Use when editing GitHub Actions workflow files (.github/workflows/*.yml)
in JacobPEvans repos. Documents when to target self-hosted RunsOn runners
vs GitHub-hosted runners, the v3 label catalog used across the org, the
required github.run_id segment, and the GitHub App allowlist prereq.
---

# Self-Hosted Runners (RunsOn)

JacobPEvans repos use self-hosted RunsOn runners deployed by
[terraform-runs-on](https://github.com/JacobPEvans/terraform-runs-on) for
Linux GitHub Actions jobs. The control plane has a fixed monthly cost
whether or not jobs run (App Runner + CloudWatch — see
[terraform-runs-on/README.md](https://github.com/JacobPEvans/terraform-runs-on/blob/main/README.md)
for the current estimate). Workflows that stay on `ubuntu-latest` waste
GitHub Actions minutes that don't need to be spent. Migrate any Linux job
in the org that isn't covered by the **GitHub-hosted** rows in the decision
table below.

## When to target RunsOn

| Workload | Decision |
| --- | --- |
| Linux job (lint, validate, build, test) | **RunsOn** — almost always |
| Job that needs `nix flake check --all-systems` | **RunsOn** with more RAM (see catalog) |
| Job that runs on `macos-latest` | **GitHub-hosted** — RunsOn EC2 Mac has a 24-hour minimum allocation, costs more than `macos-latest` for short jobs |
| Job that runs on `windows-latest` | **RunsOn** supports Windows; treat case-by-case |
| Job generated by `gh-aw compile` (`*.lock.yml`) | **GitHub-hosted** — lock file is regenerated; runner label must flow through the `.md` companion (gh-aw doesn't expose this yet) |
| Job with disabled `schedule:` (manual dispatch only, rarely runs) | **GitHub-hosted** — migration saves nothing |
| Job in a repo that hasn't been added to the RunsOn GitHub App allowlist | **GitHub-hosted** — install the app first |

## Runner label catalog

Use the single-string format. The leading `runs-on=${{ github.run_id }}`
segment is **required** so the RunsOn control plane can correlate the
GitHub Actions `workflow_job` webhook back to the originating run.
Omitting it makes the job hang in `queued` forever.

| Workload | Label string |
| --- | --- |
| Standard step (lint, validate, small build) | `runs-on=${{ github.run_id }}/runner=2cpu-linux-x64` |
| Nix `flake check` (Linux only) | `runs-on=${{ github.run_id }}/cpu=4/ram=16/family=m7+c7/extras=s3-cache` |
| Build with large dependency cache | `runs-on=${{ github.run_id }}/cpu=4/ram=16/volume=80gb:gp3:500mbs:4000iops/extras=s3-cache` |
| Heavy CPU (terraform plan over many modules) | `runs-on=${{ github.run_id }}/cpu=8/ram=32/family=c7a` |
| GPU (Hugging Face, MLX cross-eval) | `runs-on=${{ github.run_id }}/family=g4dn.xlarge/image=ubuntu22-gpu-x64` |

## Pattern in YAML

```yaml
jobs:
validate:
runs-on: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64"
steps:
- uses: actions/checkout@v6
- run: ...
```

For reusable workflows in `JacobPEvans/.github`, callers pass the label
through the `runner_label` input (default `ubuntu-latest`):

```yaml
jobs:
markdown-lint:
uses: JacobPEvans/.github/.github/workflows/_markdown-lint.yml@main
with:
runner_label: "runs-on=${{ github.run_id }}/runner=2cpu-linux-x64"
```

Each reusable workflow's `runs-on:` line then expands the input:

```yaml
jobs:
lint:
runs-on: ${{ inputs.runner_label }}
```

This keeps `ubuntu-latest` as the safe default for consumers that haven't
opted in.

## Prerequisites for a new repo

1. The RunsOn CloudFormation stack must be applied (`terraform-runs-on/main`).
2. The RunsOn GitHub App must be installed on the target repo (either
organization-wide with "All repositories" selected, or the repo added
individually under the App settings).
3. The first migrated workflow should be a low-risk canary (a lint or
validate job, not the whole `Merge Gate`). Watch one run end-to-end
before migrating the rest.

## Identifying RunsOn vs github-hosted in a run

In the GitHub Actions UI, expand the `Set up runner` group on any step.
A RunsOn run prints:

```text
RUNS_ON_VERSION: v3.x.x
RUNS_ON_INSTANCE_ID: i-...
RUNS_ON_INSTANCE_TYPE: m8i.large
RUNS_ON_INSTANCE_LIFECYCLE: spot
```

If those variables are missing despite the `runs-on=...` label, the job
didn't land on RunsOn. GitHub does **not** silently fall back to github-hosted
when a custom label is unmatched — the job sits in `queued` state waiting
for a runner that never picks it up. Most common causes: the repo isn't in
the RunsOn GitHub App allowlist, the `${{ github.run_id }}` segment is
missing from the label so the control plane can't correlate the
`workflow_job` webhook, or AWS spot capacity for the requested family is
briefly exhausted (RunsOn v3's spot circuit breaker handles this but a
queue stall can still happen during the fallback). The `_ci-gate.yml`
watchdog in `JacobPEvans/.github` cancels any job stuck in `queued` after
`queue_timeout_minutes` so the merge gate isn't blocked indefinitely.

## Cost allocation

Every RunsOn-launched EC2 instance is tagged with `runs-on=...`. AWS Cost
Explorer can be filtered by that tag group to attribute spend per repo,
workflow, and job. No per-workflow setup is needed — the tag is applied
by the RunsOn control plane.

## Related

- [Migration guide](https://github.com/JacobPEvans/terraform-runs-on/blob/main/docs/migration-guide.md) — the canonical playbook for migrating a single repo
- [RunsOn v3 job labels](https://runs-on.com/configuration/job-labels/) — upstream label spec
- [terraform-runs-on README](https://github.com/JacobPEvans/terraform-runs-on/blob/main/README.md) — infra deployment
- `ci-cd-policy` rule (auto-loaded org rule) — billing/runner policy
Loading