Skip to content

ian/adding_k8_backend#36

Draft
ianhodge wants to merge 2 commits intomainfrom
03-19-ian_adding_k8_backend
Draft

ian/adding_k8_backend#36
ianhodge wants to merge 2 commits intomainfrom
03-19-ian_adding_k8_backend

Conversation

@ianhodge
Copy link
Member

@ianhodge ianhodge commented Mar 19, 2026

Summary

This PR adds a Kubernetes execution backend to oz-agent-worker and includes the deployment and hardening work needed to make that backend practical to run in a customer Kubernetes environment.

At a high level, the worker can now execute tasks by creating Kubernetes Jobs instead of running them via Docker or the direct backend. The PR also adds a namespace-scoped Helm chart, updates the docs for customer deployment, and tightens the production path with CI coverage, safer chart defaults, and runtime/container hardening.

What changed

Kubernetes backend

  • added a new Kubernetes backend implementation in internal/worker/kubernetes.go
  • added config parsing / merge support for backend.kubernetes.* in internal/config/config.go and main.go
  • execute each task as a Kubernetes Job / Pod in a target namespace
  • support setup and teardown hooks for task execution
  • propagate configured environment variables into task jobs
  • support Kubernetes-specific execution settings including:
    • namespace / kubeconfig selection
    • image pull secret / image pull policy
    • task job service account
    • node selectors / tolerations / resource requests and limits
    • extra labels / annotations
    • active deadline / termination grace period / workspace size limit
    • configurable unschedulable timeout
    • configurable startup preflight_image
  • added a startup dry-run Job preflight so policy / RBAC / admission issues surface before task execution begins
  • removed the need for a cluster-scoped namespace read at startup; validation stays namespaced
  • added stable hash-based labels and job naming to avoid selector collisions after sanitization
  • updated worker shutdown cleanup to use a fresh context for backend cleanup

Helm chart

  • added a namespace-scoped chart at charts/oz-agent-worker
  • chart deploys:
    • long-lived worker Deployment
    • ServiceAccount
    • namespaced Role / RoleBinding
    • worker config ConfigMap
    • optional API key Secret
  • chart is designed for in-cluster auth by default
  • chart distinguishes between:
    • the worker Deployment service account
    • the optional task Job service account configured via backend.kubernetes.service_account
  • chart now requires an explicit image.tag so installs pin a worker image rather than defaulting to latest
  • chart defaults the long-lived worker Deployment to a non-root security context with conservative resource requests
  • added kubernetesBackend.preflightImage so restricted clusters can override the startup preflight image

CI / packaging / docs

  • updated CI to:
    • use the Go version from go.mod
    • run go test ./...
    • lint and render the Helm chart in CI
  • fixed .gitignore so the top-level binary is ignored without accidentally ignoring charts/oz-agent-worker/**
  • hardened the runtime Dockerfile to run the worker as a non-root user on a pinned Alpine base image
  • expanded README.md with:
    • Kubernetes backend configuration and caveats
    • Helm installation flow
    • production notes for explicit image pinning
    • non-root worker defaults
    • preflight image override guidance
    • the distinction between worker and task-job service accounts

Operational notes

  • this backend does not require CRDs
  • this backend does not create cluster-scoped RBAC resources
  • the worker Deployment is intended to run long-term in-cluster
  • each task is executed as a Kubernetes Job
  • the worker Deployment defaults to non-root, but the task namespace must still allow creating Jobs with a root init container because sidecar materialization currently depends on that pattern
  • keep replicaCount=1 for a given worker.workerId; scale by creating multiple releases with distinct worker IDs instead of scaling a single release horizontally
  • if cluster policy restricts allowed registries/images, set preflight_image / kubernetesBackend.preflightImage to an allowlisted image

Validation

  • gofmt -w on modified Go files
  • go test ./...
  • go build ./...
  • helm lint charts/oz-agent-worker --set worker.workerId=my-worker --set image.tag=v1.2.3
  • helm template oz-agent-worker charts/oz-agent-worker --namespace agents --set worker.workerId=my-worker --set image.tag=v1.2.3
  • helm lint + helm template again with richer override values to exercise optional chart branches including secret creation, annotations, node selectors, tolerations, resources, setup/teardown hooks, environment entries, and kubernetesBackend.preflightImage
  • docker build to verify the hardened runtime image still builds successfully

Reviewer notes

The highest-risk / highest-value areas to review are:

  • internal/worker/kubernetes.go for job lifecycle, startup preflight behavior, and failure detection
  • main.go + internal/config/config.go for config merge / validation behavior
  • charts/oz-agent-worker/* for install ergonomics and namespaced deployment assumptions
  • README.md for customer-facing deployment guidance and caveats

Artifacts

Co-Authored-By: Oz oz-agent@warp.dev

@ianhodge ianhodge force-pushed the 03-18-ian_adding_customizable_idle_on_complete branch from 43b8290 to 8b0459f Compare March 19, 2026 20:28
@ianhodge ianhodge force-pushed the 03-19-ian_adding_k8_backend branch 2 times, most recently from 5a67f57 to 1c8df93 Compare March 19, 2026 20:37
@ianhodge ianhodge force-pushed the 03-18-ian_adding_customizable_idle_on_complete branch 2 times, most recently from 7cfae36 to 5b80342 Compare March 19, 2026 20:41
@ianhodge ianhodge force-pushed the 03-19-ian_adding_k8_backend branch 2 times, most recently from 4a8394b to 1964392 Compare March 19, 2026 20:45
Base automatically changed from 03-18-ian_adding_customizable_idle_on_complete to main March 19, 2026 20:51
@ianhodge ianhodge force-pushed the 03-19-ian_adding_k8_backend branch from 1964392 to cbff910 Compare March 19, 2026 21:28
…robe, Helm fixes

- Replace 2s poll loop with Kubernetes Watch for Job and Pod status,
  with 30s safety-net fallback poll for watch disconnects
- Bound container log reads to 1 MiB (LimitBytes + io.LimitReader)
- Sort env vars for deterministic Pod specs
- Gate Events API calls behind pod failure signals (Pending/Failed only)
- Add exec liveness probe to Helm Deployment (kill -0 1)
- Fix ConfigMap and ServiceAccount template whitespace (use {{- trimming)
- Add watch verb to RBAC for jobs and pods
- Add tests for handleJobState, watch lifecycle, and pod watch events

Co-Authored-By: Oz <oz-agent@warp.dev>
# Install ca-certificates for HTTPS connections
RUN apk --no-cache add ca-certificates
# Install ca-certificates for HTTPS connections and create a non-root runtime user
RUN apk --no-cache add ca-certificates \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are needed for the helm chart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant