Skip to content

refactor/total → main#127

Merged
jacaudi merged 304 commits into
mainfrom
refactor/total
May 22, 2026
Merged

refactor/total → main#127
jacaudi merged 304 commits into
mainfrom
refactor/total

Conversation

@jacaudi
Copy link
Copy Markdown
Owner

@jacaudi jacaudi commented May 21, 2026

refactor/total → main

Initial integration of the operator's v2alpha1 API and complete
controller stack. This is the long-pending refactor branch's first
merge to mainmain previously held only project scaffolding.

refactor/total is 291 commits ahead of main, touching 364 files (+44,579 / -38,938).

What's in this PR

CRD surface — v2alpha1

Five CRDs in the cloudflare.io/v2alpha1 API group:

CRD Purpose
CloudflareZone Onboard a domain as a Cloudflare zone
CloudflareZoneConfig Zone-level settings (SSL, security, performance, network, DNS, bot management)
CloudflareRuleset Phase-scoped rulesets (custom WAF, transforms, rate-limit, redirect, …)
CloudflareDNSRecord DNS records with Observe / Managed modes + TXT-verified Adopt
CloudflareTunnel Cloudflare Tunnel + operator-managed cloudflared dataplane

Plus annotation-driven attachment from Gateway API and Services:
cloudflare.io/tunnel: "true" on a Gateway / Service opts it in.

Architecture

  • Meta-operator topology: a top-level manager that spawns zone +
    tunnel sub-operators based on chart values.
  • Source controllers for Service / Gateway / HTTPRoute / TLSRoute
    watch their respective object types and emit CloudflareDNSRecord CRs
    pointing at the tunnel CNAME.
  • TXT companion registry marks every operator-managed DNS record
    with a paired cf-txt.<hostname> TXT entry (plaintext JSON, or
    optional AES-256-GCM via TxtRegistryKeySecretRef). Adoption is
    refused if the companion doesn't identify the adopting CR.
  • Feature F force-reconcile via cloudflare.io/reconcile-at
    annotation — restart-immune ack via Status.LastReconcileToken.
  • Cross-namespace zone references via
    cloudflare.io/zone-ref-namespace.

Reconciliation model

  • Unified terminal status-write epilogue: reconcile.UpdateStatusIfChanged[T]
    used by all 5 reconcilers, with the universal in-memory-rollback
    invariant on the no-write branch.
  • reconcile.HaltWith for Ready=False short-circuits (sibling to
    HaltDependency / HaltCredentialsUnavailable).
  • LRU *cfgo.Client cache (32-entry, 30-min absolute TTL) keyed by
    sha256(token || accountID) to reuse HTTP/2 connection pools.
  • Hash-gated cloudflared dataplane patches via two new Status fields
    on CloudflareTunnel (ObservedDataplane{Deployment,Service}Hash).
  • Label-scoped Secret cache (app.kubernetes.io/part-of=cloudflare-operator)
    to avoid a cluster-wide LIST/WATCH on every Secret.

Documentation

  • Top-level README.md with project blurb, Quickstart, Disclaimer,
    Features, Documentation table, Acknowledgements.
  • docs/ carries 9 in-depth pages (8 hand-authored + 1 auto-generated):
    architecture adjacent topics — credentials, gateway-api, tunnels,
    adopting-existing-records, annotations, reconciliation, troubleshooting,
    txt-registry — plus the auto-generated crd-reference.md.
  • chart/README.md is now a pure values reference, auto-generated.
  • examples/ carries 9 ready-to-apply manifests (5 CRD samples + 4
    annotation-driven attachment samples).

Test coverage

  • Unit tests: every package carries _test.go files; go test ./... -race -count=1 is 12/12 packages PASS.
  • Envtest: 22 integration test files at test/envtest/ exercising
    the full reconcile loop against a real kube-apiserver. The chart's
    CI workflow runs them with setup-envtest + Kubernetes 1.30 assets.
  • golangci-lint clean (4 known issues — gofmt × 2 + gosec G115 +
    staticcheck QF1008 — documented as accepted).

Build + release artifacts

  • Operator image: ghcr.io/jacaudi/cloudflare-operator
  • Helm chart: oci://ghcr.io/jacaudi/charts/cloudflare-operator
  • Both built per-commit via the test-build.yml workflow on
    refactor/total. Latest test build:
    v0.0.0-alpha.refactor-total.e05390f (green).

Verification

Reviewer checklist before merging:

  • Triggered the full ci-cd.yml pipeline against refactor/total
    (smoke-deploy gate — explicit user gate per project memory).
  • Confirmed home-ops MR !5957 (the OCIRepository repoint to the
    latest refactor/total chart tag) has been reviewed.
  • Confirmed deploy-side state on the live cluster matches expectations.
  • Spot-checked the operator's behavior against docs/troubleshooting.md
    Phase=Ready, Status.LastReconcileToken ack roundtrip works,
    ConnectorReady is True on attached tunnels.

Notes

  • This PR sets the initial v2alpha1 baseline on main. Subsequent
    PRs land incrementally; the refactor/total branch will continue
    as the integration branch until v1 cutover.
  • Once merged, the chart's Chart.yaml version line should bump to
    a release-track number (currently 0.1.0).
  • LICENSE file at repo root is still a pending follow-up — source
    headers carry Apache 2.0, but no top-level LICENSE file exists yet.
  • chart/README.md restructure from chronological-slice-log to pure
    values reference is captured by the new make generate tooling
    (helm-docs + crd-ref-docs). Hand-edits to chart/README.md
    or docs/crd-reference.md are overwritten on the next run.

Out of scope

  • Stable tagged release. After this lands on main, a separate
    release-prep PR will bump version + chart appVersion to 0.1.0
    proper and publish a non-alpha tag.
  • AES TXT-registry codec operator-wiring (the aesCodec is
    code-present but the operator hardcodes nil keyref today). Tracked
    in the local backlog as a separate Design→Plan→Execute track.

jacaudi and others added 30 commits May 15, 2026 10:10
Strict equality with annotation value 'true' — every other value defaults
to direct-create-equivalent (no auto-GC). Pure, side-effect-free. (Design §3.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two mutually-exclusive predicates serving different reconcile-flow
positions:
- needsOwnerTransfer (early): sources exist, owner gone -> promote one.
- isOrphaned (late): no sources, no owner, auto-created -> candidate
  for self-delete subject to two-tick grace.

(Design §3.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3 original cases never stressed the invariant: under the regression
of dropping isOrphaned's AttachedSources==0 clause, none flipped
nt && io to true. Add the discriminating state (auto-created + sources +
no owners) where correct code gives nt=true/io=false but the regression
gives both true. Verified the new case fails under the simulated
regression and passes with the real implementation. (Per-task review finding.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lex-smallest live remaining attacher promoted to controller-owner via an
optimistic-lock client.MergeFrom Patch. Filters DeletionTimestamp'd
candidates; direct typed Get catches NotFound between snapshot and Patch;
bounded at 5 attempts; Conflict -> (false,nil) so the next reconcile
retries; emits ReasonOwnerTransferred on success.

Adaptation vs plan-literal: reuses reconcile.SetControllerOwner
(controllerutil.SetControllerReference) for the owner-ref instead of a
hand-built apiutil.GVKForObject OwnerReference -- project code-reuse
mandate, identical Controller/BlockOwnerDeletion/GVK shape, drops the
apiutil dependency. needsOwnerTransfer guarantees zero pre-existing
owner refs so the single-owner replace cannot conflict.

(Design §4.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cked

The plan's reference code used plain client.MergeFrom(tn), which adds NO
resourceVersion precondition — the apiserver never returns 409 Conflict,
so the IsConflict->(false,nil) retry branch was dead in production and
two concurrent owner-transfers could silently last-write-wins clobber
each other. Design §4.2 (and the plan's own prose) require optimistic
locking. Switch to MergeFromWithOptions(tn, MergeFromWithOptimisticLock{})
so the patch carries tn.ResourceVersion and a stale Patch is rejected
with Conflict, which the existing branch turns into a clean next-reconcile
retry. Plan-internal-contradiction drift resolved in favor of the design
(per-task review finding; same drift-handling pattern as P3).

Also documents that cross-namespace attachers surface as (false,err) via
SetControllerReference and are out of scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
needsOwnerTransfer -> TransferOwnershipIfNeeded runs between the
credentials-halt branch and the status snapshot (design §4.1 step 5) so
all subsequent reconcile work sees a valid OwnerReference. Successful
transfer requeues immediately for a fresh view; no-live-candidates falls
through to orphan-state management (Task 9). Finalizer is already ensured
before this point.

(Design §4.1.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-created tunnel CRs with no owner and no attached sources stamp
Status.LastOrphanedAt on first observation (persisted via the existing
change-detection gate; LastOrphanedAt is unmasked) and self-delete on a
later reconcile once the 60s two-tick grace window elapses (Warning
TerminalNoSources event + terminal Ready=False, then Delete). Source
reattach / owner promotion within the window clears the stamp.
Direct-create CRs are skipped entirely (isAutoCreated gate). A
pendingRequeueAfter local overrides the default interval during the
grace window (explicit local, no defer).

(Design §4.1 step 10, §3.1.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hook

Adds CloudflareTunnelReconciler.PendingDeletionGrace (optional; zero =>
60s pendingDeletionGrace const) so envtests run with a short two-tick
window instead of real 60s sleeps — brief-mandated adaptation of the
plan's pendingDeletionGraceForTest mirror; also a legitimate operator
knob. Production behaviour unchanged when unset (Task-9 unit tests pass
as-is). setupServiceEnv sets a 3s grace (verified harmless to existing
callers: none deletes the last source of an auto-created tunnel then
asserts persistence).

TestEnvtest_CascadeGC_OwnerTransfer: two annotated Services share a
tunnel-name; deleting the owner Service transfers ownership to the
remaining Service; the auto-created tunnel persists.

(Design §8.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single annotated Service -> auto-created tunnel; Service deleted; after
the (short, harness-overridden) two-tick grace the tunnel CR stamps
LastOrphanedAt then self-deletes (Warning TerminalNoSources, drain via
the finalizer path, CR gone). Reuses Task-10's GC-emulation (envtest has
no garbage collector, so the dead owner's dangling ownerReference is
stripped to let the production isOrphaned path fire — the stamp/grace/
self-delete are all production code).

(Design §8.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The GC-emulation strip returned as soon as ownerRefs hit zero. If the
tunnel reconciler then fired before the ServiceSourceReconciler cleared
the deleted Service from the shared cache, observeAttachedSources still
saw the source, isOrphaned was false, and the reconciler requeued at
defaultTunnelInterval (30m) — so the test timed out in strict isolation
(it only passed in the full suite because concurrent managers warmed the
scheduler so the service reconciler won the race).

Gate the strip's completion on BOTH ownerRefs==0 AND observed
AttachedSources==0. Bump a test-only label (cloudflare.io/test-strip-tick)
on every iteration to force a real k8s write (real rv change → real watch
event) on each tick: k8s 1.30 does not bump resourceVersion for no-op
writes, so a plain Update with already-empty ownerRefs would produce no
watch event, leaving the tunnel reconciler stuck on its 30m requeue. The
label value changes every tick; the tunnel reconciler ignores labels
entirely. The repeated real writes keep re-triggering the tunnel
reconciler until it observes the drained cache and the isOrphaned path
can fire. Test-only; production unchanged; still mutation-proven
non-vacuous (neutering self-delete fails the test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scenarios + a shared GC-emulation helper:
- TwoTickRaceProtection: source deleted, LastOrphanedAt stamped, a new
  annotated Service reattaches within the grace window -> tunnel is NOT
  self-deleted, LastOrphanedAt cleared, ownership transfers to the new
  source.
- DirectCreateNeverGCd: a user-authored CloudflareTunnel (no
  cloudflare.io/auto-created) adopts then loses its only attaching
  Service -> never stamps LastOrphanedAt, never self-deletes (isAutoCreated
  gate skips the orphan path). Adopt path does not backfill the marker.
- Extracted gcEmulateStripDeadOwner shared by the last-source +
  two-tick-race tests (code-reuse; envtest has no GC). Renamed the
  strip-tick label out of the reserved cloudflare.io/* prefix to
  envtest.local/strip-tick (test-only mechanism, prior-review nit).

(Design §3.1, §8.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The require.Never passed because the tunnel reconciler was quiescent
(no Service watch; source-deletion only clears the cache, no tunnel-CR
write -> no watch event -> next requeue is the 30m defaultTunnelInterval),
NOT because the isAutoCreated gate blocked self-delete: mutating
isAutoCreated->true left the test green.

Two interacting issues kept the test vacuous:
1. No retriggering: without forcing reconcile events during the Never
   window, the orphan block was never re-evaluated.
2. Dangling ownerRef: needsOwnerTransfer promotes the first attaching
   source (the Service) to controller-owner on a subsequent reconcile;
   envtest has no garbage collector, so after the Service is deleted its
   ownerRef dangles — len(OwnerReferences)>0 permanently blocked
   isOrphaned from evaluating, regardless of retriggering.

Fix: each require.Never tick now strips any dangling ownerReference whose
owner is the deleted Service (envtest GC emulation, same pattern as T10
and T11) AND bumps a benign envtest.local/retrigger-tick label so the
production orphan block is actively re-evaluated while the source is gone.
With the isAutoCreated gate intact the unannotated direct-create tunnel is
skipped every tick (no stamp, no self-delete). With the gate mutated to
return true the tunnel self-deletes within the 12s window and
require.Never trips — the isAutoCreated->true mutation now FAILS the test
(verified: require.Never tripped at 5.52s). Duration extended 8s->12s
to comfortably cover stamp + 3s grace + drain. No-backfill assertions
preserved; docstring corrected to describe the retrigger+strip mechanism
and the harness-fidelity note about needsOwnerTransfer. Test-only;
production unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
needsOwnerTransfer was ungated, so a Service annotation-attaching to a
user-authored direct-create CloudflareTunnel triggered owner-transfer,
stamping the Service as Controller+BlockOwnerDeletion owner of the user's
CR — and deleting that Service let Kubernetes GC delete the user's tunnel,
violating design §7 ("direct-create tunnel ... CR persists") and the
"user-authored CRs are never auto-GC'd" invariant (our isAutoCreated
self-delete gate does not stop k8s GC of an operator-attached
controller-ref). Gate needsOwnerTransfer on isAutoCreated, symmetric with
isOrphaned: the entire cascade-GC machinery (owner-transfer rebalancing +
self-delete) is now inert for direct-create CRs; the operator never takes
controller-ownership of a user's tunnel. TransferOwnershipIfNeeded (the
mechanism) stays ungated — needsOwnerTransfer is the single policy gate.
Surfaced by the Task-12 envtest deep review (mutation-proven). New unit +
envtest pin the invariant; rippled Task-6 true-case fixture annotated.

(Design §7, §3.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Batched Minor findings from the final comprehensive review, zero
behavior change:
- drop bare plan-task references from a tunnel_controller.go comment and
  from cascade-GC envtest godoc (Pattern-13: no plan artifacts in
  shipped code/comments)
- correct stale post-§7-gate harness comments (direct-create CRs never
  acquire a controller-ownerRef in the unmutated run; the per-tick strip
  is mutation-only scaffolding)
- gcEmulateStripDeadOwner: ctx-first parameter order (Go convention)
- DirectCreateNeverGCd: reuse bumpRetriggerTick instead of an inline copy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These six files accumulated gofmt drift across the prior squash-merges
onto refactor/total (trailing blank lines, struct/comment column
re-alignment, one import-group reorder). They predate and are unrelated
to any in-flight feature branch, but `make lint` (gofmt -l . | (! read))
fails repo-wide on their account — blocking CI for every branch cut from
refactor/total. Pure `gofmt -w`; zero behavior change (build, vet, full
unit suite green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chore: gofmt the 6 pre-existing unformatted files on refactor/total
P4: tunnel cascade GC + owner transfer (design §12.6)
RegistryPayload (v1 schema), AffixName helper (apex + subdomain +
multi-dot collapse), ErrUnrecognizedCodec sentinel. Pure (no client
deps); Codec interface + impls land in tasks 2-4. (Design §3.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…xName

Adds TestRegistryPayload_JSONTags (verifies compact tags + h omitempty —
the wire-format contract was previously untested), collapses the
redundant len==2 branch in AffixName to the uniform form, and locks
ErrUnrecognizedCodec's message. Addresses P5-T1 review findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bare-JSON encoder/decoder + Codec interface; v1alpha1 default codec.
Rejects unknown schema versions and malformed JSON via
ErrUnrecognizedCodec wrapping. (Design §4.1.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecode

Both plaintext Decode-failure tests now assert errors.Is(err,
ErrUnrecognizedCodec) — the sentinel Task 9/12 branch on for
AdoptRefusedNoTXT. Adds a compile-time Codec satisfaction guard for
plaintextCodec. Addresses P5-T2 review findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Encrypted codec, v1:<nonce>:<ciphertext> wire format, fresh GCM nonce
per Encode. All decode failures (wrong key, tampering, malformed
envelope, bad version) wrap ErrUnrecognizedCodec so callers collapse
to a single AdoptRefusedNoTXT condition. (Design §4.2.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pattern-13 cleanup: the Codec interface doc and aesCodec comment named
autoDetectingCodec before it exists. Trimmed to describe only the
currently-implemented codecs; the read-side dispatcher documents itself
when it lands. Addresses P5-T3 review findings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sniffs the v1: prefix to dispatch plaintext vs AES decode. Encode
panics (read-only wrapper). v1: input with no key configured surfaces
ErrUnrecognizedCodec so the reconciler refuses adoption. Restores the
Codec interface doc to list all three impls now that the type exists.
(Design §4.3.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spec.Mode: Managed (default) | Observe. omitempty is safe (string with
non-empty kubebuilder default). Designed for reuse on other CRDs that
grow read-only semantics; shared reconcile.ShouldMutate helper lands in
task 7. CRD YAML regenerated to internal/bootstrap/crds (no make
manifests target — make generate does deepcopy+crd). (Design §3.3.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The plan placed +kubebuilder:validation:Enum on both the RecordMode
type and the Mode field, producing a redundant allOf with two
identical enum blocks in the generated CRD. The named type carries the
enum; the field inherits it. Keeping the marker only on the type
yields a single clean enum. Plan-vs-idiomatic adaptation. Addresses
P5-T5 review finding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds TxtRecordID, TxtAffix, ObservedTXT (typed ObservedTXTPayload
mirroring the decoded codec form + RawContent fallback). All additive
omitempty; v1alpha1-safe. Deepcopy + CRD YAML regenerated to
internal/bootstrap/crds. (Design §3.3.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-line predicate: false iff mode == "Observe". String-typed so
future CRDs with different enum names reuse it without per-CRD wiring.
(Design §3.4.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AdoptRefusedNoTXT, AdoptRefusedForeign, TxtRegistryKeyUnavailable,
Observing. Kept 1:1 across the const block, ZoneReasons(), and the
TestZoneReasons_Registered enumerating test in the same commit (per
the project's recurring conventions-registry-sync finding). Consumed
by the DNSRecord reconciler in P5 tasks 10-14. (Design §3.3.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jacaudi and others added 26 commits May 21, 2026 11:47
Self-contained sample: applying examples/annotated_httproute.yaml now
materializes both the tunnel-enabled Gateway and the attached HTTPRoute
in one go. The standalone annotated_gateway.yaml remains as a
Gateway-only reference; annotated_httproute_override.yaml depends on
this Gateway (cross-referenced in its header).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
config/crd/bases/ is purely a kubebuilder/controller-gen output —
make generate recreates it from api/v2alpha1/ Go types and copies the
result into chart/templates/crd-*.yaml (the published source). Keeping
the intermediate config/crd/bases/ in VCS leaves the door open for
drift between Go types and committed CRDs, so untrack it and add
config/ to .gitignore. Verified make generate still works post-untrack
and the chart-template CRDs still get refreshed.

While here, fix the README's 'Local development' section. The prior
'make install' / 'make deploy' commands were kubebuilder boilerplate —
those targets never existed in this Makefile. Replaced with the
actual flow:

  make generate                          # regenerate CRD bundles
  kubectl apply -f config/crd/bases/     # apply locally
  go run ./cmd/manager                   # run the operator

Plus a chart-from-local-checkout install option for users who want
to test chart changes without publishing to OCI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit untracked config/crd/bases/ but make generate still
recreated the directory locally on every run. Now make generate writes
the intermediate CRD bundles straight to bin/crd-staging/ (already
gitignored via bin/), so the config/ directory is never produced.

Changes:
- Makefile: output:crd:artifacts:config=bin/crd-staging (was: config/crd/bases)
- Makefile: chart-template copy loop reads from bin/crd-staging/ (was: config/crd/bases/)
- .gitignore: drop the now-unneeded config/ entry
- README: local-dev kubectl apply path updated to bin/crd-staging/

Verified: make generate → bin/crd-staging/ populated, chart/templates/crd-*.yaml refreshed; no config/ anywhere in the tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the (API token, account ID) credential pair end-to-end:

  - The credential precedence: per-CR spec.cloudflare → operator-level
    default → ErrAccountIDUnset.
  - The token Secret shape (SecretReference: name + namespace + key)
    with sensible defaults (namespace = CR's namespace; key = 'token').
  - The label-scope requirement: every Secret the operator should be
    able to read must carry app.kubernetes.io/part-of=cloudflare-operator.
    This is the #1 first-time-setup failure mode; without the label,
    Get returns NotFound and resolve fails with ErrSecretNotFound.
  - Inline accountID vs accountIDSecretRef (the XValidation rule) +
    the 'key defaults to token, not accountID' footgun.
  - Credential rotation: the cfgo.Client cache's 30-minute absolute TTL
    and how to force immediate adoption via cloudflare.io/reconcile-at.
  - Common errors with concrete fixes: ErrSecretNotFound,
    ErrSecretKeyMissing, ErrAccountIDUnset, 401/403 from Cloudflare.
  - Pointer to multiple-accounts.md (future) for multi-tenant patterns.

Linked from README's Documentation table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The canonical reference for every cloudflare.io/* annotation the
operator reads or writes. Pairs with examples/annotated_*.yaml — the
samples demonstrate; this doc explains.

Groups annotations by purpose:
  - Tunnel attachment (cloudflare.io/tunnel, /tunnel-name,
    /gateway-service, /gateway-apex, /hostnames)
  - Inherited by emitted DNSRecord (Gateway → Route → operator default
    precedence chain; covers zone-ref, zone-ref-namespace, proxied,
    ttl, no-tls-verify, origin-server-name, scheme, adopt)
  - DNS-only (Service path: dns-record, dns-target, port)
  - Force-reconcile (cloudflare.io/reconcile-at, restart-immune)
  - Operator-managed (auto-created; don't set manually)

Also documents the ParseTruthy vocabulary (true/yes/enable/enabled +
their negatives, case-insensitive, whitespace-trimmed) and the
'precedence: Route > Gateway > operator default' rule.

Cross-linked from README's Documentation table. References reconciliation.md
+ credentials.md + (future) gateway-api.md + adopting-existing-records.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The headline integration story. Covers:

  - What the operator materializes when a Gateway is opted into tunneling:
    auto-created CloudflareTunnel CR, cloudflared Deployment, metrics
    Service, connector-token Secret, N emitted CloudflareDNSRecord CRs.
  - The 'opt-in lives ONLY on Gateway, not Route' invariant (Slice 1
    correctness fix); Routes attach via parentRefs and inherit the
    tunnel decision.
  - Required annotations on the Gateway: cloudflare.io/tunnel,
    cloudflare.io/gateway-service (no label fallback), cloudflare.io/zone-ref.
  - Optional inheritance annotations and the Route > Gateway > default
    precedence chain.
  - HTTPRoute vs TLSRoute attachment shapes; the scheme-from-listener
    inheritance.
  - Per-Route override patterns (grey-cloud one hostname, self-signed
    backend cert, apex bypass).
  - The full generated-object inventory (Kubernetes side + Cloudflare side).
  - Cascade-GC: auto-created vs direct-create lifecycle.
  - Common gotchas: GatewayServiceUnspecified, wildcard-only Gateway
    requiring gateway-apex, Route attached but no DNS record (3 causes),
    records appear but tunnel doesn't serve (3 causes).

Cross-linked from README's Documentation table and from
annotations.md / reconciliation.md / credentials.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Explains the TXT-companion-verified adopt flow: why it exists, what
the three outcomes are, the recommended migration flow, and why the
operator never silently backfills.

Structure:
  - Why adopt exists (the take-over-without-outage use case).
  - Why the TXT companion exists (operator-wars + accidental-clobber
    failure modes that motivate ownership marking).
  - The companion's shape: cf-txt.<hostname> sibling DNS name carrying
    a JSON ownership payload (namespace + name + UID).
  - The three adopt outcomes:
      Adopted — companion identifies THIS CR.
      AdoptRefusedNoTXT — primary exists, companion doesn't.
      AdoptRefusedForeign — companion identifies a different CR.
  - The recommended flow: Observe (reconnaissance) → optional manual
    companion injection → Adopt:true + Mode:Managed.
  - The pragmatic 'just delete and recreate' migration path for
    pre-feature records, with the brief-outage trade-off documented.
  - Why no silent backfill (distinguishes pre-feature records from
    other-operator records; refuses to clobber).
  - Status / condition reference for the adopt-refusal reasons.
  - When NOT to use adopt (greenfield, observe-first, post-delete
    recreate via the S1 self-heal path).

Cross-linked to credentials.md (resolve failures), reconciliation.md
(force-reconcile after fix), annotations.md (cloudflare.io/adopt for
Gateway-API path), and the future txt-registry.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The operational pair to reconciliation.md. Where reconciliation.md
answers 'when does the next reconcile fire?', this answers 'the
reconcile fired and something is wrong, now what?'

Structured as a 6-step diagnostic sequence:

  Step 0 — confirm operator pods are alive
  Step 1 — read Status (Phase + Conditions[Ready] + LastSyncedAt /
           LastReconcileToken)
  Step 2 — match the Ready=False reason to a fix:
    - Credential reasons: CredentialsUnavailable, CredentialsInsufficient
    - Adoption reasons: AdoptRefusedNoTXT, AdoptRefusedForeign,
      AdoptedExistingRecord
    - Tunnel reasons: GatewayServiceUnspecified, DuplicateHostname,
      ControllerOffline
    - Zone/DNSRecord reasons: DependencyMissing, DriftDetected,
      OwnershipCompanionFailed, Ignored
    - ZoneConfig/Ruleset: SettingsApplied, SettingsApplyFailed,
      PlanTierInsufficient
  Step 3 — read operator logs (meta + zone + tunnel sub-operators)
  Step 4 — check Events
  Step 5 — did the reconcile actually run? (annotation-ack trick)
  Step 6 — read the Cloudflare side (dashboard / dig)

Plus a quick-reference cheat sheet of the canonical kubectl
invocations and a 'when to involve a maintainer' template.

Cross-linked from README's Documentation table. References every
other docs/* page; designed to be the entry point when a user has a
problem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the standalone CloudflareTunnel CR end-to-end — the
CRD-first counterpart to gateway-api.md's annotation-driven view:

  - Two creation paths: direct-create (user lifecycle) vs auto-created
    (operator lifecycle via cloudflare.io/tunnel: true on a source).
  - The 52-char naming budget and why it exists (cloudflared-<name>
    must fit DNS-1123 63-char label limit; rename is unsupported per
    the CEL immutability rule).
  - What the operator materializes: tunnel + remote config on the
    Cloudflare side; Deployment + metrics Service + connector-token
    Secret on the Kubernetes side (all owner-referenced to the CR).
  - The connector spec surface: replicas (1-25, no HPA), protocol
    (auto/http2/quic), logLevel, gracePeriodSeconds, resources, and
    scheduling pass-throughs; plus the cloudflared image-override
    precedence chain (per-CR > chart-default > compile-time pin).
  - Routing defaults: fallback (httpStatus vs url XOR) and the
    originRequest passthrough (with a heads-up on the limited
    TunnelOriginRequest CRD shape — only noTLSVerify +
    originServerName today).
  - The Status surface field-by-field, including the Slice 2 H
    ObservedDataplane{Deployment,Service}Hash optimization fields
    and the G ObservedIngress used for skip-when-snap-matches.
  - Cascade-GC rules: auto-created tunnels self-delete when last
    source detaches; direct-create tunnels never auto-GC.
  - Common gotchas: 'Ready=True but no traffic flows',
    ConnectorReady=False, naming conflicts, OOM tuning.

Cross-linked from README's Documentation table. References every
other docs/* page and the example file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the four forward-references that existed from
adopting-existing-records.md and credentials.md. Covers the TXT
ownership-marker mechanism end-to-end:

  - Companion naming: cf-txt.<hostname> as a fresh leftmost DNS
    label (post-S1 affix scheme). Wildcards rewrite to _wildcard
    sentinel. Legacy cf-txt-<host> form is read-recognised but
    never written; pruned opportunistically.
  - RegistryPayload JSON shape (v=1, k=kind, ns=namespace,
    n=name, h=optional content hash). Identifier is
    (kind, namespace, name) — NOT UID — so deleting and recreating
    a CR with the same NS/name lets the new CR adopt cleanly.
  - AES-256-GCM wire format: 'v1:<base64-nonce>:<base64-ciphertext>'.
    Random 12-byte GCM nonce per encode. Enabled via
    CloudflareCredentialRef.TxtRegistryKeySecretRef pointing at a
    32-byte AES key Secret.
  - The encrypt-vs-plaintext trade-offs: AES hides owner identity
    from public DNS readers; doesn't help against API-token-holders
    (who see primaries directly) or Kubernetes-API-readers (who see
    CRs directly).
  - Rolling between modes: free because the read side autodetects;
    no rotation primitive today, work around via plaintext→fresh AES
    re-cycle.
  - Inspecting companions: dig from outside, Status.txtRecordID
    from inside, operator log grep.
  - The engineering migration procedure for adopting pre-companion
    records without an outage (vs. the simpler delete-and-recreate
    flow in adopting-existing-records.md).
  - Common gotchas: OwnershipCompanionFailed root causes, missing
    companions, AES decrypt failures after key rotation.

Cross-linked from README's Documentation table; closes the
'(future)' placeholders in adopting-existing-records.md and
credentials.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chart/README.md and docs/crd-reference.md are now generated by
make generate, not hand-maintained. Two new tools added to the
make tools target:

  - github.com/norwoodj/helm-docs renders chart/README.md from
    chart/values.yaml + chart/README.md.gotmpl. Field-level
    descriptions live in values.yaml using the helm-docs '# --'
    convention.
  - github.com/elastic/crd-ref-docs renders docs/crd-reference.md
    from the api/v2alpha1 Go types. Templates forked under
    hack/crd-ref-docs-templates/ so CRDs (types with a GVK)
    promote to H2 with a horizontal-rule separator above each, and
    non-Kinds collect at the bottom under a single '### Sub-types'
    section. The fork is two files (gv_details.tpl + type.tpl);
    gv_list.tpl + type_members.tpl are upstream-identical so a
    later upstream pull can refresh just those two.

chart/values.yaml is restructured to use the helm-docs '# --'
field-level annotation convention. Every documented field now has
its description directly above; helm-docs picks them up into the
generated values table. Behavior unchanged — only the comment
shape changed.

hack/crd-ref-docs-config.yaml filters out the *List wrapper types
(CloudflareDNSRecordList, etc.) which are never user-instantiated.

The regenerated output (chart/README.md, docs/crd-reference.md)
lands in a follow-up commit so bisecting works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Result of running 'make generate' against the auto-gen tooling
landed in the prior commit. chart/README.md goes from 683 lines
of hand-maintained chronological-slice content to a 47-line pure
values reference; docs/crd-reference.md is a new 886-line field-by-field
reference for all 5 CRDs and their 27 sub-types.

README's Documentation table updated:
  - chart/README.md: replaced the now-stale 'chronological behavior-change
    notes' description with 'auto-generated from chart/values.yaml'.
  - docs/crd-reference.md: new row (auto-generated from api/v2alpha1).

Hand-edits to these files will be overwritten by the next
make generate. The chronological behavior-change content that
previously lived in chart/README.md is preserved in the
docs/<discrete-domain>.md pages shipped earlier (gateway-api.md,
tunnels.md, txt-registry.md, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the 7 local-only docs subdirs (design/, plans/, prompt/, follow/,
issues/, review/, summary/) under a single docs/appendix/ parent and
collapses the .gitignore from 7 individual rules to one
(docs/appendix/). Committable in-depth docs continue to live as flat
files at docs/<discrete-domain>.md per the established convention.

Also refreshes the .dockerignore explanatory comment to reflect the
post-config-removal state: drops the now-stale 'config/' reference
and mentions examples/ + .private-journal/ which are also auto-excluded
by the **-then-allowlist pattern. The allowlist semantics are
unchanged (Docker build still includes only cmd/ + api/ + internal/
+ go.mod/go.sum).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore CHANGELOG.md from main so release automation continues to own it,
and align chart/Chart.yaml with main's metadata (version 0.18.3, kubeVersion,
keywords, sources, maintainers) while keeping the refactor's common 5.0.1
bump. Description rewritten to reflect the meta-operator + bundled
controllers shape and Gateway API integration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Track future Cloudflare surfaces — likely CRD shapes and open design
questions for each — without committing to scope. Includes a top-of-page
TOC and an explicit "what this page is not" disclaimer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-checked R2, Email Service, and Workers CRD sketches against the
cloudflare-go v6 SDK + public API reference and tightened the likely-CRD
lists to match real endpoint shapes (R2 lifecycle/CORS sub-resources,
Workers version→deployment model, Email Routing's zone-scoped rules vs
account-scoped addresses). Added Containers as a fourth track with an
explicit "SDK surface unverified" caveat since cloudflare-go didn't
surface it in the docs query — design work must re-confirm the paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cloudflare's docs now explicitly recommend Workers Static Assets over
Pages for new static sites, SPAs, and full-stack apps. Rework the Workers
section accordingly: call out the three modes (pure static / SPA /
full-stack), extend Worker.spec with an assets sub-spec, and replace the
"Pages as a separate track" open question with the now-relevant artifact
delivery problem (scripts are one file, asset bundles are directory
trees — OCI artifact is the obvious fit). Adds a residual Pages-legacy
question for managing existing Pages projects.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- test/envtest: CRDDirectoryPaths references the old config/crd/bases tree
  that 7cfad92 removed. Point it at bin/crd-staging instead, matching the
  Makefile's controller-gen output path. Unblocks every push since 7cfad92.

- txt_canonical: silence gosec G115 false positive on byte(val). The
  preceding `if val > 255 { return raw }` bounds-checks val into the byte
  range; gosec can't see the guard.

- tunnel_controller_test + tlsroute_source_controller_test: apply gofmt —
  colon-alignment on an ObjectMeta literal and an indented `drainLoop:`
  label that should sit at column 1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Split out a manifests target (controller-gen for CRDs only) and make both
generate and test depend on it. The envtest harness reads CRDs from
bin/crd-staging — without this dep, `make test` in CI ran before any CRD
generation and panicked with "no such file or directory".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…subsections

Context7 confirmed cloudflare-go is still on v6 (no v7 indexed) and that
Cloudflare Containers ships as a Worker binding (Durable-Object-style
class, config declared in the Worker code) rather than as a standalone
REST product. There's no client.Containers.* surface in the typed Go SDK.

Restructure the roadmap accordingly:

- Remove the standalone Containers section; its substance moves into
  Worker.spec.containers[] per binding (image, defaultPort, sleepAfter,
  envVars, entrypoint, egress policy).
- Add a fourth Worker mode (container-backed) alongside pure-static /
  SPA / full-stack, with a matching paragraph on the [[containers]]
  upload payload.
- Add two new subsections under Workers — "static assets (the Pages
  replacement)" and "containers" — so the deeper detail on each mode is
  navigable and not buried in one giant CRD bullet.
- Extend the open questions with container image source, egress
  ergonomics, and the management-API gap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… D semantics

Bug D (commit 8082d5f, 2026-05-19) intentionally changed the source
controllers' chain-CNAME semantics via chainContentFor (apex.go): when
the parent Gateway has a concrete listener and no cloudflare.io/gateway-
apex annotation, the chain content is now tn.Status.TunnelCNAME, not the
listener hostname. The HTTPRoute + TLSRoute envtest assertions were
written before Bug D and kept asserting the literal listener hostname
("ext.example.com"), so they timed out at 15s on Eventually after Bug D
landed. The mismatch was masked by an unrelated CRD-path break (fixed in
9460b9d) until envtest could run again.

This change is test-only. The controllers are correct.

- A1: update the two failing tests to capture tn.Status.TunnelCNAME from
  the parent tunnel CR (after the gateway-creation helper waits on it)
  and assert the emitted chain DNSRecord's Content matches.
- C: add two new tests (HTTPRoute + TLSRoute) that exercise the apex-
  override branch of chainContentFor — Gateway carries
  cloudflare.io/gateway-apex: "apex.example.com" and the chain content
  is asserted to equal that override. Previously had no envtest coverage.
- Refactor: createGatewayForRouteTest + createTLSGatewayForRouteTest now
  take a gatewayApex string ("" = no annotation); existing call sites
  pass "".

Verified locally: make test passes; the four target cases pass in 0.7-3.0s
each (down from 15.5s timeouts).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a paragraph under the Disclaimer making the authorship pattern
explicit, so adopters can weigh that when evaluating fit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a top-level LICENSE file (MIT, Copyright (c) 2026 jacaudi), update
the Kubebuilder boilerplate template + every Go file's header, update
the README's license section, and add artifacthub.io/license: MIT to
the chart annotations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ExternalSecret + SecretStore wiring that materializes the
`cloudflare-credentials` Secret the operator consumes. Provider-agnostic
— callers swap the SecretStore.provider stanza for their backend
(1Password Connect, Vault, AWS SM, GCP SM, …) without touching the
ExternalSecret.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Show that both credential-resolution patterns are valid: per-CR override
via spec.cloudflare (kept on cloudflarednsrecord, cloudflaretunnel, and
the ESO example) and operator-level default via the chart's
credentials.{existingSecret,tokenKey,accountIDKey} (now demonstrated by
cloudflarezone, cloudflareruleset, and cloudflarezoneconfig).

Also adds the ESO + CR example (external_secret_cloudflare_credentials.yaml)
held over from the previous round.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jacaudi jacaudi merged commit 9c705de into main May 22, 2026
7 checks passed
@jacaudi jacaudi deleted the refactor/total branch May 22, 2026 05:10
@wall-e-one
Copy link
Copy Markdown
Contributor

wall-e-one Bot commented May 22, 2026

🎉 This PR is included in version 0.19.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@wall-e-one wall-e-one Bot added the released label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant