Migrate e2e tests from Buildkite to GitHub Actions#781
Migrate e2e tests from Buildkite to GitHub Actions#781
Conversation
Move the eight end-to-end tests (upgrade, backup-restore, three backup-schedule variants, vtorc-vtadmin, unmanaged-tablet, hpa) from the Buildkite public queue onto GitHub Actions using ubuntu-latest-8-cores runners. Each test runs as its own matrix job, in parallel, on a fresh VM per job — which lets us drop Buildkite's per-test concurrency gate since the collisions it was guarding against (shared vitess-operator-pr image tag, fixed localhost port-forward ports, shared kind docker network on sibling-container agents) no longer exist when each job gets its own runner and Docker daemon. utils.sh loses the BUILDKITE_JOB_ID coupling: the variable is renamed to CI_JOB_ID (set from github.run_id + run_attempt + matrix target), the sibling-container networking hack in setupKubectlAccessForCI is removed because kind now runs directly on the runner host, and the docker build progress flag keys off \$CI instead. The pre-exit hook that reset perms and cleaned up the shared Docker state is no longer needed — GHA runners are ephemeral. Branch protection on main and release-** will need the new e2e check names added and the Buildkite checks removed once the first run is green. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
There was a problem hiding this comment.
Pull request overview
Migrates the repository’s end-to-end CI coverage from Buildkite to GitHub Actions by introducing a new GHA matrix workflow and removing Buildkite-specific pipelines/hooks, while adjusting the e2e test harness to be CI-agnostic.
Changes:
- Added a new GitHub Actions workflow that runs 8 e2e tests as parallel matrix jobs.
- Updated e2e test utilities to use a generic
CI_JOB_IDand removed Buildkite-only kubectl/kind networking setup. - Removed Buildkite pipeline/hook files and updated release documentation to remove a stale Buildkite reference.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
test/endtoend/utils.sh |
Replaces Buildkite-specific job ID handling with CI_JOB_ID, removes Buildkite-only kubectl access workaround, and tweaks CI log output behavior. |
docs/release-process.md |
Rewords a Buildkite-specific note to refer generically to end-to-end tests. |
.github/workflows/e2e-test.yaml |
Introduces the new GHA matrix workflow to run all e2e test targets in parallel. |
.buildkite/pipeline.yml |
Deletes the old Buildkite pipeline configuration. |
.buildkite/hooks/pre-exit |
Deletes the old Buildkite pre-exit cleanup/permission workaround hook. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| CI_JOB_ID: ${{ github.run_id }}-${{ github.run_attempt }}-${{ matrix.test.target }} | ||
| steps: | ||
| - name: Check out code | ||
| uses: actions/checkout@v6 |
There was a problem hiding this comment.
actions/checkout is referenced by a mutable tag (@v6). For supply-chain safety and reproducibility, pin this action to a specific commit SHA (similar to how actions/setup-go is pinned) so workflow runs can't change behavior unexpectedly if the tag is moved.
| uses: actions/checkout@v6 | |
| uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # pinned for supply-chain safety |
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Two issues surfaced on the first GHA run: 1. The `Backup Schedule vtctldclient Method Test` failed kind cluster creation with `sethostname: invalid argument` because the container hostname `kind-<CI_JOB_ID>-control-plane` exceeded the Linux 64-char HOST_NAME_MAX. The `github.run_id`+`run_attempt`+target scheme I used was overkill anyway — each GHA job runs in its own ephemeral VM, so there's no collision risk from reusing a short cluster name. Shorten CI_JOB_ID to just the matrix target. 2. The `Unmanaged Tablet Test` failed at `apt install chromium-browser` when the runner's snap layer hit an apparmor error on mesa-2404. That test doesn't even need chromium — only `vtorc-vtadmin-test` uses headless chromium. Narrow the install to only that matrix entry and route through the well-maintained `browser-actions/setup-chrome` action, aliasing the chrome binary to `chromium-browser` so the existing `getChromiumBinaryName` discovery in utils.sh still works. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Every e2e test running under the new GHA runner group failed with mysqld pods in CrashLoopBackOff. The pattern (mysqld exiting ~17ms after spawn, before producing any InnoDB output) and the runner image tag (ubuntu-24.04 / Noble) point to the AppArmor userns restriction that Ubuntu 23.10 introduced: kernel.apparmor_restrict_unprivileged_userns=1 (default) This blocks processes inside nested containers from creating their own user namespaces, which mysqld depends on during startup. Buildkite's public queue runs on an older base, so it doesn't hit this. Workaround is the standard one for kind-in-Ubuntu-24.04 CI: sysctl the restriction off at the start of the job. Also bump inotify limits, which kind wants for its file watchers once a cluster has more than a couple of pods running. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Previous apparmor_restrict_unprivileged_userns=0 alone didn't unblock mysqld inside the vttablet pods — 6 of 8 tests still CrashLoopBackOff with the same my.cnf symptom after kind comes up fine. Notably Unmanaged Tablet Test passed (the one test that does not run a vitess-operator-managed mysqld inside the cluster), which pins the remaining breakage on something specific to mysqld-in-nested-container. Ubuntu 24.04 ships multiple layers of restriction on unprivileged user namespaces plus a broader AppArmor profile for Docker. Belt-and- suspenders: also clear apparmor_restrict_unprivileged_unconfined, enable unprivileged_userns_clone, raise user.max_user_namespaces, and tear AppArmor down entirely. The runner VM is ephemeral, so neutering AppArmor for the job has zero blast radius. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Stopping apparmor.service and running aa-teardown in the previous commit unloads all AppArmor profiles from the kernel, including the docker-default profile that BuildKit applies to build containers. The operator image build then died with `runc run failed ... unable to apply apparmor profile`, vitess-operator-pr:latest never got built, kind load fell through, and every pod sat at ErrImageNeverPull. Keep the userns sysctls (they were the actual target of the fix) and leave the AppArmor service and profiles alone so Docker keeps working. Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Previous attempt to neuter AppArmor broke the docker build because
BuildKit requires the docker-default profile to launch build
containers. Previous attempt to keep AppArmor on while just setting
userns sysctls left mysqld still crashing in nested pods (same
my.cnf symptom, 5 of 8 e2e tests red).
Split the work across two workflow steps:
1. Build operator image — runs with AppArmor still up so BuildKit
is happy and produces vitess-operator-pr:latest.
2. Disable AppArmor before kind — stops apparmor.service and runs
aa-teardown now that no more docker builds need to happen.
setupBuildContainerImage in utils.sh gains a docker-image-inspect
short-circuit so the in-test build call is a no-op when CI already
pre-built the image. Locally (no pre-built image) it still builds as
before.
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Summary
Moves the eight end-to-end tests (upgrade, backup-restore, three backup-schedule variants, vtorc-vtadmin, unmanaged-tablet, hpa) off the Buildkite
publicqueue and onto GitHub Actions. Each test runs as a matrix job onubuntu-latest-8-cores(8 vCPU / 32 GB), all in parallel, with no per-test concurrency gate.Why
Buildkite's agent model put every job inside a
docker:latestcontainer talking to a shared host Docker daemon, which forced a lot of incidental complexity on us: a 15-line Alpine bootstrap in every step to install Go from a tarball, a sibling-container networking hack to makekubectlreachkind, apre-exithook that ransudo fix-buildkite-agent-builds-permissionsto undo chown damage from the operator container mutating the shared checkout, and aconcurrency: 1gate per test to keep two jobs on the same agent from fighting over the hardcodedvitess-operator-pr:latestimage tag and the fixed15306/14001/15999port-forward ports.On GitHub-hosted runners every job gets a fresh VM with its own Docker daemon and its own localhost, so all four problems evaporate for free. The Alpine bootstrap becomes
actions/setup-go, the sibling-container hack is deleted, the permission fix-up is unnecessary on ephemeral runners, and the concurrency gate is gone because the resources the tests share on Buildkite aren't shared on GHA.Changes
.github/workflows/e2e-test.yaml— matrix over the 8 test targets,fail-fast: false, 40-minute timeout per job, installsmysql-client+chromium-browser(the latter is only needed byvtorc-vtadminbut harmless to preinstall everywhere).test/endtoend/utils.shBUILDKITE_JOB_ID→CI_JOB_ID(still defaults to0somake upgrade-teston a laptop is unchanged). The workflow sets it to${{ github.run_id }}-${{ github.run_attempt }}-${{ matrix.test.target }}.setupKubectlAccessForCIentirely — the Buildkite-specificdocker network connect kind+kind get kubeconfig --internaldance is not needed when kind runs directly on the runner host.docker build --progress plainbranch now keys off$CI(GHA sets this automatically)..buildkite/pipeline.ymland.buildkite/hooks/pre-exit.docs/release-process.md— updated one stale "buildkite" reference.Follow-ups (outside this PR)
ubuntu-latest-8-coresresolves — if the org hasn't provisioned larger runners, the first run will fail fast and we can swap the label toubuntu-latest(4 vCPU / 16 GB) or whatever larger label is available.mainandrelease-**to require the newe2ematrix check names and drop the Buildkite required checks.