Skip to content

Migrate e2e tests from Buildkite to GitHub Actions#781

Open
nickvanw wants to merge 9 commits intomainfrom
ci/migrate-to-github-actions
Open

Migrate e2e tests from Buildkite to GitHub Actions#781
nickvanw wants to merge 9 commits intomainfrom
ci/migrate-to-github-actions

Conversation

@nickvanw
Copy link
Copy Markdown
Contributor

Summary

Moves the eight end-to-end tests (upgrade, backup-restore, three backup-schedule variants, vtorc-vtadmin, unmanaged-tablet, hpa) off the Buildkite public queue and onto GitHub Actions. Each test runs as a matrix job on ubuntu-latest-8-cores (8 vCPU / 32 GB), all in parallel, with no per-test concurrency gate.

Why

Buildkite's agent model put every job inside a docker:latest container talking to a shared host Docker daemon, which forced a lot of incidental complexity on us: a 15-line Alpine bootstrap in every step to install Go from a tarball, a sibling-container networking hack to make kubectl reach kind, a pre-exit hook that ran sudo fix-buildkite-agent-builds-permissions to undo chown damage from the operator container mutating the shared checkout, and a concurrency: 1 gate per test to keep two jobs on the same agent from fighting over the hardcoded vitess-operator-pr:latest image tag and the fixed 15306/14001/15999 port-forward ports.

On GitHub-hosted runners every job gets a fresh VM with its own Docker daemon and its own localhost, so all four problems evaporate for free. The Alpine bootstrap becomes actions/setup-go, the sibling-container hack is deleted, the permission fix-up is unnecessary on ephemeral runners, and the concurrency gate is gone because the resources the tests share on Buildkite aren't shared on GHA.

Changes

  • New .github/workflows/e2e-test.yaml — matrix over the 8 test targets, fail-fast: false, 40-minute timeout per job, installs mysql-client + chromium-browser (the latter is only needed by vtorc-vtadmin but harmless to preinstall everywhere).
  • test/endtoend/utils.sh
    • Renamed BUILDKITE_JOB_IDCI_JOB_ID (still defaults to 0 so make upgrade-test on a laptop is unchanged). The workflow sets it to ${{ github.run_id }}-${{ github.run_attempt }}-${{ matrix.test.target }}.
    • Deleted setupKubectlAccessForCI entirely — the Buildkite-specific docker network connect kind + kind get kubeconfig --internal dance is not needed when kind runs directly on the runner host.
    • The docker build --progress plain branch now keys off $CI (GHA sets this automatically).
  • Deleted .buildkite/pipeline.yml and .buildkite/hooks/pre-exit.
  • docs/release-process.md — updated one stale "buildkite" reference.

Follow-ups (outside this PR)

  1. Verify ubuntu-latest-8-cores resolves — if the org hasn't provisioned larger runners, the first run will fail fast and we can swap the label to ubuntu-latest (4 vCPU / 16 GB) or whatever larger label is available.
  2. After the first green run, update branch protection on main and release-** to require the new e2e matrix check names and drop the Buildkite required checks.
  3. Turn off the Buildkite pipeline in the Buildkite UI so it stops running on PRs.

Move the eight end-to-end tests (upgrade, backup-restore, three
backup-schedule variants, vtorc-vtadmin, unmanaged-tablet, hpa) from the
Buildkite public queue onto GitHub Actions using ubuntu-latest-8-cores
runners. Each test runs as its own matrix job, in parallel, on a fresh
VM per job — which lets us drop Buildkite's per-test concurrency gate
since the collisions it was guarding against (shared vitess-operator-pr
image tag, fixed localhost port-forward ports, shared kind docker
network on sibling-container agents) no longer exist when each job gets
its own runner and Docker daemon.

utils.sh loses the BUILDKITE_JOB_ID coupling: the variable is renamed to
CI_JOB_ID (set from github.run_id + run_attempt + matrix target), the
sibling-container networking hack in setupKubectlAccessForCI is removed
because kind now runs directly on the runner host, and the docker build
progress flag keys off \$CI instead. The pre-exit hook that reset
perms and cleaned up the shared Docker state is no longer needed — GHA
runners are ephemeral.

Branch protection on main and release-** will need the new e2e check
names added and the Buildkite checks removed once the first run is
green.

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Copilot AI review requested due to automatic review settings April 16, 2026 19:40
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the repository’s end-to-end CI coverage from Buildkite to GitHub Actions by introducing a new GHA matrix workflow and removing Buildkite-specific pipelines/hooks, while adjusting the e2e test harness to be CI-agnostic.

Changes:

  • Added a new GitHub Actions workflow that runs 8 e2e tests as parallel matrix jobs.
  • Updated e2e test utilities to use a generic CI_JOB_ID and removed Buildkite-only kubectl/kind networking setup.
  • Removed Buildkite pipeline/hook files and updated release documentation to remove a stale Buildkite reference.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/endtoend/utils.sh Replaces Buildkite-specific job ID handling with CI_JOB_ID, removes Buildkite-only kubectl access workaround, and tweaks CI log output behavior.
docs/release-process.md Rewords a Buildkite-specific note to refer generically to end-to-end tests.
.github/workflows/e2e-test.yaml Introduces the new GHA matrix workflow to run all e2e test targets in parallel.
.buildkite/pipeline.yml Deletes the old Buildkite pipeline configuration.
.buildkite/hooks/pre-exit Deletes the old Buildkite pre-exit cleanup/permission workaround hook.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

CI_JOB_ID: ${{ github.run_id }}-${{ github.run_attempt }}-${{ matrix.test.target }}
steps:
- name: Check out code
uses: actions/checkout@v6
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actions/checkout is referenced by a mutable tag (@v6). For supply-chain safety and reproducibility, pin this action to a specific commit SHA (similar to how actions/setup-go is pinned) so workflow runs can't change behavior unexpectedly if the tag is moved.

Suggested change
uses: actions/checkout@v6
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # pinned for supply-chain safety

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/e2e-test.yaml Outdated
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Two issues surfaced on the first GHA run:

1. The `Backup Schedule vtctldclient Method Test` failed kind cluster
   creation with `sethostname: invalid argument` because the container
   hostname `kind-<CI_JOB_ID>-control-plane` exceeded the Linux 64-char
   HOST_NAME_MAX. The `github.run_id`+`run_attempt`+target scheme I
   used was overkill anyway — each GHA job runs in its own ephemeral VM,
   so there's no collision risk from reusing a short cluster name.
   Shorten CI_JOB_ID to just the matrix target.

2. The `Unmanaged Tablet Test` failed at `apt install chromium-browser`
   when the runner's snap layer hit an apparmor error on mesa-2404.
   That test doesn't even need chromium — only `vtorc-vtadmin-test`
   uses headless chromium. Narrow the install to only that matrix entry
   and route through the well-maintained `browser-actions/setup-chrome`
   action, aliasing the chrome binary to `chromium-browser` so the
   existing `getChromiumBinaryName` discovery in utils.sh still works.

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Every e2e test running under the new GHA runner group failed with
mysqld pods in CrashLoopBackOff. The pattern (mysqld exiting ~17ms
after spawn, before producing any InnoDB output) and the runner image
tag (ubuntu-24.04 / Noble) point to the AppArmor userns restriction
that Ubuntu 23.10 introduced:

  kernel.apparmor_restrict_unprivileged_userns=1 (default)

This blocks processes inside nested containers from creating their own
user namespaces, which mysqld depends on during startup. Buildkite's
public queue runs on an older base, so it doesn't hit this.

Workaround is the standard one for kind-in-Ubuntu-24.04 CI: sysctl the
restriction off at the start of the job. Also bump inotify limits,
which kind wants for its file watchers once a cluster has more than a
couple of pods running.

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Previous apparmor_restrict_unprivileged_userns=0 alone didn't unblock
mysqld inside the vttablet pods — 6 of 8 tests still CrashLoopBackOff
with the same my.cnf symptom after kind comes up fine. Notably
Unmanaged Tablet Test passed (the one test that does not run a
vitess-operator-managed mysqld inside the cluster), which pins the
remaining breakage on something specific to mysqld-in-nested-container.

Ubuntu 24.04 ships multiple layers of restriction on unprivileged user
namespaces plus a broader AppArmor profile for Docker. Belt-and-
suspenders: also clear apparmor_restrict_unprivileged_unconfined,
enable unprivileged_userns_clone, raise user.max_user_namespaces, and
tear AppArmor down entirely. The runner VM is ephemeral, so neutering
AppArmor for the job has zero blast radius.

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Stopping apparmor.service and running aa-teardown in the previous
commit unloads all AppArmor profiles from the kernel, including the
docker-default profile that BuildKit applies to build containers. The
operator image build then died with `runc run failed ... unable to
apply apparmor profile`, vitess-operator-pr:latest never got built,
kind load fell through, and every pod sat at ErrImageNeverPull.

Keep the userns sysctls (they were the actual target of the fix) and
leave the AppArmor service and profiles alone so Docker keeps working.

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
Previous attempt to neuter AppArmor broke the docker build because
BuildKit requires the docker-default profile to launch build
containers. Previous attempt to keep AppArmor on while just setting
userns sysctls left mysqld still crashing in nested pods (same
my.cnf symptom, 5 of 8 e2e tests red).

Split the work across two workflow steps:

  1. Build operator image — runs with AppArmor still up so BuildKit
     is happy and produces vitess-operator-pr:latest.
  2. Disable AppArmor before kind — stops apparmor.service and runs
     aa-teardown now that no more docker builds need to happen.

setupBuildContainerImage in utils.sh gains a docker-image-inspect
short-circuit so the in-test build call is a no-op when CI already
pre-built the image. Locally (no pre-built image) it still builds as
before.

Signed-off-by: Nick Van Wiggeren <nick@planetscale.com>
@mattlord mattlord requested review from mattlord and mhamza15 April 16, 2026 22:16
Copy link
Copy Markdown
Collaborator

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants