Migrate e2e tests from Buildkite to GitHub Actions by nickvanw · Pull Request #781 · planetscale/vitess-operator

nickvanw · 2026-04-16T19:40:22Z

Summary

Moves the eight end-to-end tests (upgrade, backup-restore, three backup-schedule variants, vtorc-vtadmin, unmanaged-tablet, hpa) off the Buildkite public queue and onto GitHub Actions. Each test runs as a matrix job on ubuntu-latest-8-cores (8 vCPU / 32 GB), all in parallel, with no per-test concurrency gate.

Why

Buildkite's agent model put every job inside a docker:latest container talking to a shared host Docker daemon, which forced a lot of incidental complexity on us: a 15-line Alpine bootstrap in every step to install Go from a tarball, a sibling-container networking hack to make kubectl reach kind, a pre-exit hook that ran sudo fix-buildkite-agent-builds-permissions to undo chown damage from the operator container mutating the shared checkout, and a concurrency: 1 gate per test to keep two jobs on the same agent from fighting over the hardcoded vitess-operator-pr:latest image tag and the fixed 15306/14001/15999 port-forward ports.

On GitHub-hosted runners every job gets a fresh VM with its own Docker daemon and its own localhost, so all four problems evaporate for free. The Alpine bootstrap becomes actions/setup-go, the sibling-container hack is deleted, the permission fix-up is unnecessary on ephemeral runners, and the concurrency gate is gone because the resources the tests share on Buildkite aren't shared on GHA.

Changes

New .github/workflows/e2e-test.yaml — matrix over the 8 test targets, fail-fast: false, 40-minute timeout per job, installs mysql-client + chromium-browser (the latter is only needed by vtorc-vtadmin but harmless to preinstall everywhere).
test/endtoend/utils.sh
- Renamed BUILDKITE_JOB_ID → CI_JOB_ID (still defaults to 0 so make upgrade-test on a laptop is unchanged). The workflow sets it to ${{ github.run_id }}-${{ github.run_attempt }}-${{ matrix.test.target }}.
- Deleted setupKubectlAccessForCI entirely — the Buildkite-specific docker network connect kind + kind get kubeconfig --internal dance is not needed when kind runs directly on the runner host.
- The docker build --progress plain branch now keys off $CI (GHA sets this automatically).
Deleted .buildkite/pipeline.yml and .buildkite/hooks/pre-exit.
docs/release-process.md — updated one stale "buildkite" reference.

Follow-ups (outside this PR)

Verify ubuntu-latest-8-cores resolves — if the org hasn't provisioned larger runners, the first run will fail fast and we can swap the label to ubuntu-latest (4 vCPU / 16 GB) or whatever larger label is available.
After the first green run, update branch protection on main and release-** to require the new e2e matrix check names and drop the Buildkite required checks.
Turn off the Buildkite pipeline in the Buildkite UI so it stops running on PRs.

Move the eight end-to-end tests (upgrade, backup-restore, three backup-schedule variants, vtorc-vtadmin, unmanaged-tablet, hpa) from the Buildkite public queue onto GitHub Actions using ubuntu-latest-8-cores runners. Each test runs as its own matrix job, in parallel, on a fresh VM per job — which lets us drop Buildkite's per-test concurrency gate since the collisions it was guarding against (shared vitess-operator-pr image tag, fixed localhost port-forward ports, shared kind docker network on sibling-container agents) no longer exist when each job gets its own runner and Docker daemon. utils.sh loses the BUILDKITE_JOB_ID coupling: the variable is renamed to CI_JOB_ID (set from github.run_id + run_attempt + matrix target), the sibling-container networking hack in setupKubectlAccessForCI is removed because kind now runs directly on the runner host, and the docker build progress flag keys off \$CI instead. The pre-exit hook that reset perms and cleaned up the shared Docker state is no longer needed — GHA runners are ephemeral. Branch protection on main and release-** will need the new e2e check names added and the Buildkite checks removed once the first run is green. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Copilot

Pull request overview

Migrates the repository’s end-to-end CI coverage from Buildkite to GitHub Actions by introducing a new GHA matrix workflow and removing Buildkite-specific pipelines/hooks, while adjusting the e2e test harness to be CI-agnostic.

Changes:

Added a new GitHub Actions workflow that runs 8 e2e tests as parallel matrix jobs.
Updated e2e test utilities to use a generic CI_JOB_ID and removed Buildkite-only kubectl/kind networking setup.
Removed Buildkite pipeline/hook files and updated release documentation to remove a stale Buildkite reference.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`test/endtoend/utils.sh`	Replaces Buildkite-specific job ID handling with `CI_JOB_ID`, removes Buildkite-only kubectl access workaround, and tweaks CI log output behavior.
`docs/release-process.md`	Rewords a Buildkite-specific note to refer generically to end-to-end tests.
`.github/workflows/e2e-test.yaml`	Introduces the new GHA matrix workflow to run all e2e test targets in parallel.
`.buildkite/pipeline.yml`	Deletes the old Buildkite pipeline configuration.
`.buildkite/hooks/pre-exit`	Deletes the old Buildkite pre-exit cleanup/permission workaround hook.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T19:45:03Z

+      CI_JOB_ID: ${{ github.run_id }}-${{ github.run_attempt }}-${{ matrix.test.target }}
+    steps:
+    - name: Check out code
+      uses: actions/checkout@v6


actions/checkout is referenced by a mutable tag (@v6). For supply-chain safety and reproducibility, pin this action to a specific commit SHA (similar to how actions/setup-go is pinned) so workflow runs can't change behavior unexpectedly if the tag is moved.

Suggested change

uses: actions/checkout@v6

uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # pinned for supply-chain safety

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Two issues surfaced on the first GHA run: 1. The `Backup Schedule vtctldclient Method Test` failed kind cluster creation with `sethostname: invalid argument` because the container hostname `kind-<CI_JOB_ID>-control-plane` exceeded the Linux 64-char HOST_NAME_MAX. The `github.run_id`+`run_attempt`+target scheme I used was overkill anyway — each GHA job runs in its own ephemeral VM, so there's no collision risk from reusing a short cluster name. Shorten CI_JOB_ID to just the matrix target. 2. The `Unmanaged Tablet Test` failed at `apt install chromium-browser` when the runner's snap layer hit an apparmor error on mesa-2404. That test doesn't even need chromium — only `vtorc-vtadmin-test` uses headless chromium. Narrow the install to only that matrix entry and route through the well-maintained `browser-actions/setup-chrome` action, aliasing the chrome binary to `chromium-browser` so the existing `getChromiumBinaryName` discovery in utils.sh still works. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Every e2e test running under the new GHA runner group failed with mysqld pods in CrashLoopBackOff. The pattern (mysqld exiting ~17ms after spawn, before producing any InnoDB output) and the runner image tag (ubuntu-24.04 / Noble) point to the AppArmor userns restriction that Ubuntu 23.10 introduced: kernel.apparmor_restrict_unprivileged_userns=1 (default) This blocks processes inside nested containers from creating their own user namespaces, which mysqld depends on during startup. Buildkite's public queue runs on an older base, so it doesn't hit this. Workaround is the standard one for kind-in-Ubuntu-24.04 CI: sysctl the restriction off at the start of the job. Also bump inotify limits, which kind wants for its file watchers once a cluster has more than a couple of pods running. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Previous apparmor_restrict_unprivileged_userns=0 alone didn't unblock mysqld inside the vttablet pods — 6 of 8 tests still CrashLoopBackOff with the same my.cnf symptom after kind comes up fine. Notably Unmanaged Tablet Test passed (the one test that does not run a vitess-operator-managed mysqld inside the cluster), which pins the remaining breakage on something specific to mysqld-in-nested-container. Ubuntu 24.04 ships multiple layers of restriction on unprivileged user namespaces plus a broader AppArmor profile for Docker. Belt-and- suspenders: also clear apparmor_restrict_unprivileged_unconfined, enable unprivileged_userns_clone, raise user.max_user_namespaces, and tear AppArmor down entirely. The runner VM is ephemeral, so neutering AppArmor for the job has zero blast radius. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Stopping apparmor.service and running aa-teardown in the previous commit unloads all AppArmor profiles from the kernel, including the docker-default profile that BuildKit applies to build containers. The operator image build then died with `runc run failed ... unable to apply apparmor profile`, vitess-operator-pr:latest never got built, kind load fell through, and every pod sat at ErrImageNeverPull. Keep the userns sysctls (they were the actual target of the fix) and leave the AppArmor service and profiles alone so Docker keeps working. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Previous attempt to neuter AppArmor broke the docker build because BuildKit requires the docker-default profile to launch build containers. Previous attempt to keep AppArmor on while just setting userns sysctls left mysqld still crashing in nested pods (same my.cnf symptom, 5 of 8 e2e tests red). Split the work across two workflow steps: 1. Build operator image — runs with AppArmor still up so BuildKit is happy and produces vitess-operator-pr:latest. 2. Disable AppArmor before kind — stops apparmor.service and runs aa-teardown now that no more docker builds need to happen. setupBuildContainerImage in utils.sh gains a docker-image-inspect short-circuit so the in-test build call is a no-op when CI already pre-built the image. Locally (no pre-built image) it still builds as before. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

mattlord

❤️

Copilot AI review requested due to automatic review settings April 16, 2026 19:40

Copilot started reviewing on behalf of nickvanw April 16, 2026 19:40 View session

Use self-hosted vitess-ubuntu-shr-4cpu-16gb runners

a34028f

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Copilot AI reviewed Apr 16, 2026

View reviewed changes

nickvanw added 7 commits April 16, 2026 19:47

Switch e2e runners to depot-ubuntu-22.04-4

2cd70d7

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

Use vitess-operator-runner 8-core hosted runners

c36fbbe

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>

mattlord requested review from mattlord and mhamza15 April 16, 2026 22:16

mattlord approved these changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate e2e tests from Buildkite to GitHub Actions#781

Migrate e2e tests from Buildkite to GitHub Actions#781
nickvanw wants to merge 9 commits intomainfrom
ci/migrate-to-github-actions

nickvanw commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Uh oh!

mattlord left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	uses: actions/checkout@v6
	uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # pinned for supply-chain safety

Conversation

nickvanw commented Apr 16, 2026

Summary

Why

Changes

Follow-ups (outside this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants