refactor(spartan): unify deployment config sources by spypsy · Pull Request #22847 · AztecProtocol/aztec-packages

spypsy · 2026-04-29T13:15:40Z

Why

Pre-PR, deployment config was scattered across:

spartan/environments/<network>.env (one .env file per network, sourced as bash)
spartan/terraform/deploy-aztec-infra/variables.tf (~944 lines of variable "FOO_BAR" declarations + defaults)
spartan/scripts/setup_gcp_secrets.sh (case-by-case REPLACE_WITH_GCP_SECRET resolution)
per-camelCase Helm value mappings hand-written in main.tf to translate var.X → Helm set block → pod env

Adding or changing a single setting often touched 3 of those files. There was no single place to look up "what does network X actually deploy with".

What changes

YAML is the single source of truth. spartan/environments/network-defaults.yml defines _deploy_defaults, _release_defaults, and networks.<name>.env. Per-network YAMLs at spartan/environments/networks/<name>.yml declare three blocks: deploy: (consumed by deploy script + Terraform var.deploy), env: (pod env baseline), and per-release blocks (mirrors Helm chart shape, e.g. validator: { replicaCount, env }, prover: { node, broker, agent }).
All .env files removed. The 16 spartan/environments/*.env files are replaced by spartan/environments/networks/*.yml.
Loader (scripts/load_network_config.sh) deep-merges defaults + preset env baseline + per-network YAML, expands ${VAR} placeholders, applies derived values (devnet's MNEMONIC_INDEX_OFFSET), resolves GCP secrets, and emits in --format=env / --format=json / --format=tfvars. The three transform stages live as small Python scripts under scripts/ (expand_placeholders.py, apply_derived.py, resolve_secrets.py) — each reads JSON on stdin, writes JSON on stdout.
Charts: aztec-node gains a generic env.configmap.yaml built from .Values.env and mounted via envFrom. Wrapper charts (aztec-validator, aztec-bot) delegate to the subchart's env via the alias key (validator.env, bot.env); their redundant per-chart env templates are removed.
Terraform: variables.tf shrinks from 944 → 22 lines. The module reads three structured inputs — var.deploy (any), var.env (map), var.releases (any) — and forwards per-release values via yamlencode(var.releases[<release>]). Same shape applied to deploy-rollup-contracts/.
Scripts/CI: source_network_env.sh and source_env_basic.sh wrap the loader; existing callers (bootstrap.sh, test_kind.sh, GHA workflows) keep their CLI signatures. setup_gcp_secrets.sh is gone — resolve_secrets.py does the same job inline.
Codegen consumers updated for the .networks.<name>.env.* path (CLI generate.sh, noir-contracts/bootstrap.sh, l1-contracts/scripts/load_network_defaults.sh).

Where to look first

spartan/environments/network-defaults.yml — the unified defaults file, including comments at the top describing how it's consumed.
spartan/scripts/load_network_config.sh — the merge pipeline (env baseline → preset → per-network YAML → placeholder expansion → derived values → secret resolution → format emission).
spartan/terraform/deploy-aztec-infra/main.tf — see how var.deploy / var.env / var.releases flow into helm_release resources.
spartan/environments/networks/devnet.yml (or testnet.yml) — examples of a real per-network override.
spartan/CLAUDE.md — updated docs describing the new flow.

Risk / non-obvious behavior

Some generated CLI defaults may differ intentionally where _prodlike baseline was expanded for consistency. CLI consumers updated to the .networks.<name>.env.* path.
networks.<name>.env is the joint codegen + runtime baseline. Editing values there changes both shipped TS artifacts AND deployed pod env — regenerate after changes (yarn-project/{ethereum,slasher,cli} && yarn generate; cd l1-contracts && ./bootstrap.sh).
apply_derived.py fails loudly if VALIDATOR_MNEMONIC_START_INDEX (and friends) appear under env: instead of deploy:, since the namespace-derived MNEMONIC_INDEX_OFFSET shift only applies to the deploy block — silently mis-placing them would cause concurrent devnets to collide on L1 nonces.

- Add `_deploy_defaults` block to `network-defaults.yml` mirroring the static defaults that previously lived in `deploy_network.sh`'s `${VAR:-default}` block. Loader seeds the `deploy:` baseline from it. - Strip the duplicated bash defaults block in `deploy_network.sh`; keep only deploy-time-derived values (RESOURCE_PROFILE cascade, validator addresses, mnemonic ranges, tf_str helpers, namespace required-var assertion). - Rename `network_deploy.sh` -> `deploy_network_with_env.sh` to make the entrypoint vs inner-script split obvious. Update callers in `bootstrap.sh`, the `deploy-network.yml` workflow, and `test_kind.sh` (which now uses the YAML loader instead of `.env` sourcing). Codegen output (CLI / ethereum / slasher) is byte-identical.

The Python `resolve_secrets` helper inside `load_network_config.sh` wrote `::add-mask::SECRET` workflow commands to stdout. That stream is captured into `merged_json` and then piped to `jq`, so the mask lines prefixed the JSON and broke every downstream `jq` pipeline with a "parse error: Expected string key before ':'", leaving env-var placeholders unresolved and triggering GHA's "the word REPLACE_WITH_GCP_SECRET is invalid" errors during deploys. - Route mask commands and diagnostics to stderr only. - Split JSON-array secrets (`ETHEREUM_RPC_URLS`, consensus arrays) into per-element mask commands, matching the legacy setup_gcp_secrets.sh behavior. - Cache fetched secrets to avoid repeated gcloud calls (R2 account id is now used for snapshot/blob/tx URL construction below). - Construct STORE_SNAPSHOT_URL / BLOB_FILE_STORE_UPLOAD_URL / TX_FILE_STORE_URL from the corresponding *_BUCKET_DIRECTORY inputs + the r2-account-id secret -- another piece that was missing from the YAML loader and was breaking real deploys. - Delete the now-orphan `setup_gcp_secrets.sh` (last consumer was the removed `.env` path; only referenced by comments now).

Cuts ~900 lines from terraform/deploy-aztec-infra/variables.tf by replacing all ~70 legacy individual `variable "X"` declarations with three structured inputs: variable "deploy" { type = any } # YAML deploy: + computed variable "env" { type = map(string) } # YAML env: pod baseline variable "releases" { type = any } # YAML per-release blocks main.tf now reads everything via local.d.<KEY> (alias for var.deploy) with appropriate tonumber()/tobool()/try() coercions at the boundary -- the YAML loader emits all values as strings. deploy_network.sh writes terraform.tfvars.json (instead of HCL tfvars) by running load_network_config.sh --format=tfvars, then jq-merging deploy-time computed values (cluster context, image overrides, contract addresses, admin API key hash, mnemonic plumbing, P2P cluster gating, L1 endpoints) into the deploy block. R2-derived URLs from resolve_secrets are promoted from env: into deploy: where main.tf gates helm releases on them. Plumbing: - deploy_network.sh now takes <network> arg (passed by deploy_network_with_env.sh and test_kind.sh) so it can re-invoke the loader for the structured JSON. Validation: - terraform validate passes. - terraform plan against generated tfvars.json resolves all variables and proceeds to k8s provider connect (failing only on cluster reachability, as expected without a real kube context).

The previous commit's `inline_values = [yamlencode(var.releases[each.key])]` pass-through was wrong for two reasons that surfaced as 30-min helm-release timeouts on next-net (rpc / validators / prover never became Ready): 1. Subchart aliasing. aztec-validator and aztec-bot wrap aztec-node as a subchart aliased `validator` / `bot`. Values must be nested under that key for the env ConfigMap (.Values.env -> *-env-from-values) to render. The loader emits release blocks at the top level, so passing them through verbatim landed env on the wrapper chart instead of the aztec-node subchart -- pods came up with NO env vars from the loader pipeline, crashlooped, and helm timed out waiting for Ready. 2. validators-* / validator name mismatch. main.tf creates helm release keys `validators` / `validators-ha-N` (plural + HA index), but the loader emits a single `validator` (singular) block matching the YAML's `_release_defaults.validator`. `var.releases.validators` was always absent, so validator pods got nothing. Fix: introduce `local.release_values_from_loader` that maps each helm release name to its properly-wrapped values, then yamlencode that. - validators* -> { validator = var.releases["validator"] } - bot_* -> { bot = var.releases["bot_*"] } - prover -> direct (loader output already nests node/broker/agent) - rpc, archive, blob_sink, full_node, fisherman, p2p_bootstrap -> direct (single aztec-node chart, top-level .Values.env is correct) Verified via `helm template`: <release>-env-from-values ConfigMap now renders for validator/prover/bot/rpc with all expected env keys (e.g. AZTEC_SLOT_DURATION, FUNDING_PRIVATE_KEY, ...), and pod envFrom correctly references it.

Gate the Slack/ClaudeBox failure notification on `steps.checkout-ref.outputs.ref == 'refs/heads/next'` so ad-hoc deploys from feature branches (and release tags) don't page the team.

Commit 14ff0fc's `_pod-template.yaml` simplification dropped the `{{- range $key, $value := .Values.node.env }}` loop in favour of the new generic `<release>-env-from-values` ConfigMap built from `.Values.env`. That broke every per-release deploy-time-computed env var that flows through Terraform `set` blocks under `<chart>.node.env.*`: - validators: PUBLISHER_KEY_INDEX_START (per-replica HA offset), VALIDATOR_HA_REPLICA_INDEX, VALIDATOR_HA_SIGNING_ENABLED, VALIDATOR_HA_DATABASE_URL, VALIDATOR_HA_POOL_MAX, VALIDATOR_HA_OLD_DUTIES_MAX_AGE_H - prover: PUBLISHER_KEY_INDEX_START These can't live in the shared `env-from-values` ConfigMap because their values differ across `validators-ha-N` releases that share the same chart -- we'd collide on the ConfigMap name. next-net pods crashed in setup-attester-keystore.sh with `PUBLISHER_KEY_INDEX_START: unbound variable`, helm waited 30 minutes for Ready, then the deploy timed out. Restore the inline `node.env` loop. The new `env-from-values` ConfigMap (for shared/baseline env from the YAML loader) coexists with it -- both are envFrom-mounted, plus this loop is inline `env` so per-release values still work.

The previous commit put deploy-time env vars (REGISTRY_CONTRACT_ADDRESS, L1_CHAIN_ID, ETHEREUM_HOSTS, OTEL endpoints, ...) in a single `common_inline_values = yamlencode({env = ...})` that was used for every helm release. For aztec-node charts (rpc, archive, ...) this correctly maps to .Values.env -> env-from-values ConfigMap. For wrapper charts it does NOT: - aztec-validator aliases aztec-node as `validator`, so env must be at `validator.env.*` to reach the subchart's .Values.env. - aztec-prover-stack aliases aztec-node three times (node/broker/agent), so env must be at `node.env.*`, `broker.env.*`, `agent.env.*`. - aztec-bot aliases aztec-node as `bot`. Result on next-net: validators crashed with "L1 registry address is required to start Aztec Node" because REGISTRY_CONTRACT_ADDRESS was in the wrapper's .Values.env (which nothing reads) rather than .Values.validator.env. Fix: factor common_env_block as a plain map, then generate per-chart-type inline values: common_inline_values -> { env: ... } (aztec-node charts) common_inline_values_validator -> { validator: { env } } (aztec-validator) common_inline_values_prover -> { node/broker/agent: { env } } (prover-stack) common_inline_values_bot -> { bot: { env } } (aztec-bot) Selected per release key in the helm_release values concat. Verified with `helm template config-net-validator aztec-validator` that REGISTRY_CONTRACT_ADDRESS and ETHEREUM_HOSTS now appear in the env-from-values ConfigMap data section.

The blob/tx filestore client uses the AWS SDK which reads AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY by default. All YAML network files and network-defaults.yml stored credentials as R2_ACCESS_KEY_ID / R2_SECRET_ACCESS_KEY, which was renamed to AWS_* only via an explicit Terraform set block for rpc/full_node/blob_sink -- validators and other charts never got the rename and crashed with CredentialsProviderError at startup. Rename everywhere to AWS_* so no rename mapping is needed: - All spartan/environments/networks/*.yml and network-defaults.yml - load_network_config.sh secret_name_for mapping (key name changes; GCP secret names r2-access-key-id / r2-secret-access-key stay the same) - main.tf local.d.R2_* references updated to local.d.AWS_*

…ty string node.env.AWS_ACCESS_KEY_ID and node.env.AWS_SECRET_ACCESS_KEY were being set via Terraform custom_settings for rpc, full_node, and blob_sink as: try(local.d.AWS_ACCESS_KEY_ID, "") `local.d` is var.deploy; the AWS credentials live in var.env (the `env:` YAML block), not `deploy:`. So try() always returned "", and the explicit `set` block emitted an inline env: entry on the pod that overrides envFrom. Kubernetes inline env takes precedence over envFrom, so the credentials from the env-from-values ConfigMap (resolved from GCP secrets by the loader) were silently clobbered with empty strings, causing CredentialsProviderError. The fix: remove the three set blocks. Credentials now flow entirely via the YAML loader's env: block -> env-from-values ConfigMap -> pod envFrom.

…col/aztec-packages into spy/deployment-config

- Merge source_env_basic.sh into source_network_env.sh; old path is a shim. - Remove BOOTSTRAP_NODES="asdf" placeholder from broker/agent defaults. - Drop unused empty-string defaults from _deploy_defaults. - Fail loudly if mnemonic *_START_INDEX is set under env: instead of deploy:.

Remove ~30 lines of bash pre-computation from deploy_network.sh: - Drop the 8 *_RESOURCE_PROFILE cascade vars (kept RESOURCE_PROFILE for eth-devnet) - Drop the P2P_NODEPORT_ENABLED/P2P_PUBLIC_IP if/else block - Drop PROVER_REAL_PROOFS (was never consumed by Terraform) - Collapse 4 mnemonic alias args (VALIDATOR/PROVER/BOT/FISHERMAN_MNEMONIC) down to a single LABS_INFRA_MNEMONIC passed through DEPLOY_OVERRIDES In main.tf, derive the removed values from existing context: - p2p_nodeport_enabled / p2p_public_ip from local.is_kind - Per-release resource profiles via try(local.d.X, is_kind ? "dev" : "prod"), so per-network YAML overrides (e.g. testnet prod-spot) still take effect - All mnemonic references now use local.d.LABS_INFRA_MNEMONIC directly

SPONSORED_FPC, TEST_ACCOUNTS, and REAL_VERIFIER live under env: in network YAMLs (so pods get them), but deploy-rollup-contracts/main.tf was reading them from var.deploy — falling back to _deploy_defaults (all false/true) instead of the per-network value. Result: contracts deployed without --test-accounts/--sponsored-fpc but validators running with TEST_ACCOUNTS=true/SPONSORED_FPC=true computed a different genesis archive root, putting the network in standby. Fix: prefer var.env for these three flags (env wins, deploy is fallback).

…n.tf b10ff73 accidentally renamed both keys to a non-existent LABS_INFRA_MNEMONIC_START_INDEX (confusing it with the LABS_INFRA_MNEMONIC secret env var, which has no _START_INDEX sibling). The loader still emits the original keys.

alexghr

I would just replace the python scripts with nodejs for consistency and get rid of yaml's bash-style env var override in favour of overriding from the environment.

alexghr · 2026-05-12T16:38:52Z


-          # need to source this for CLUSTER
-          source "./environments/${{ inputs.network }}.env"
+          # Source for CLUSTER (prefers YAML loader, falls back to legacy .env).


I think the .env files are gone?

alexghr · 2026-05-12T16:42:28Z

@@ -0,0 +1,63 @@
+deploy:
+  NAMESPACE: "${NAMESPACE:-alpha-net}"


This scares me.

alexghr · 2026-05-12T16:49:26Z

Hm, can this be bash or Nodejs rather than Python? IIUC it's only used to offset mnemonic indices for devnet releases.

alexghr · 2026-05-12T16:55:13Z

Same, I'd make this bash/nodejs.

I would also consider replacing this script and piecemeal env overloading with top-level JS-style overloading e.g. {...valuesWeGotFromYaml, ...valuesWeGotFromEnv}

alexghr · 2026-05-12T16:59:21Z

next PR: we replace this script with external secrets

yeah, thought about doing it but I think this PR is big enough already 😅

alexghr · 2026-05-12T17:00:49Z

alexghr · 2026-05-12T17:01:49Z

alexghr · 2026-05-12T17:04:50Z

+  network                    = try(local.d.NETWORK, "")
+  store_snapshot_url         = try(var.env.STORE_SNAPSHOT_URL, "")
+  blob_file_store_upload_url = try(var.env.BLOB_FILE_STORE_UPLOAD_URL, "")
+  prover_agent_image_str     = try(local.d.PROVER_AGENT_DOCKER_IMAGE, "")
+  validator_ha_image_str     = try(local.d.VALIDATOR_HA_DOCKER_IMAGE, "")
+  otel_endpoint              = try(local.d.OTEL_COLLECTOR_ENDPOINT, "")
+  rpc_cloud_armor            = try(local.d.RPC_CLOUD_ARMOR_POLICY_NAME, "")
+  rpc_session_affinity       = try(local.d.RPC_INGRESS_SESSION_AFFINITY, "")
+  external_bootnodes         = try(local.d.EXTERNAL_BOOTNODES, [])


This is a step in the right direction but eventually I'd want this to become just a pass-through: take a YAML string and TF just pass it on to helm

…override model - Replace expand_placeholders.py, apply_derived.py, resolve_secrets.py with TS equivalents - Remove \${VAR:-default} placeholder syntax from all network YAMLs; values are now plain literals - Add env spread in apply_derived.ts: shell env wins for any key in deploy:/env: blocks - Add derived: YAML sections for computed values (devnet ingress/bucket paths, TX collection URLs) - apply_derived.ts is now generic -- no network-specific logic hardcoded in TS

…v vars

spypsy changed the base branch from next to merge-train/spartan April 29, 2026 13:16

refactor(spartan): unify deployment config sources

25c157d

spypsy force-pushed the spy/deployment-config branch from 14ff0fc to 25c157d Compare April 29, 2026 13:27

spypsy added 15 commits April 29, 2026 15:50

quoting fixes

09563fb

ci(deploy-network): only alert on next branch

ad21fc7

Gate the Slack/ClaudeBox failure notification on `steps.checkout-ref.outputs.ref == 'refs/heads/next'` so ad-hoc deploys from feature branches (and release tags) don't page the team.

HA var fix

97713f4

blob_sink 1 replica

def8213

fix P2P_ENABLEd

223668d

fix bot startCmd

19ffccd

fix bots, remove more camelCase vars

ec149cc

spypsy force-pushed the spy/deployment-config branch from c5f59a0 to ec149cc Compare May 6, 2026 14:10

spypsy and others added 9 commits May 6, 2026 14:23

undo unecessary changes

682f4c1

Merge conflicts + topology fixes

96640f5

Merge branch 'merge-train/spartan' into spy/deployment-config

5bd39aa

fix noir-project bootstrap

f7127da

Merge branch 'spy/deployment-config' of https://github.com/AztecProto…

dae4079

…col/aztec-packages into spy/deployment-config

remove --archiver from infra

0a87a03

Merge branch 'merge-train/spartan' into spy/deployment-config

7ca13dd

fix env / deploy shapes

cdd6459

migrate contract deployment

ab7ed4b

spypsy force-pushed the spy/deployment-config branch from b810ad5 to ab7ed4b Compare May 12, 2026 15:02

better wrapping for bot & validator

4bff2cc

spypsy added 3 commits May 12, 2026 15:14

fix r2 env values sourcing

9bb4fe3

remove default duplicates

3e8bb62

spypsy marked this pull request as ready for review May 12, 2026 15:27

spypsy requested a review from charlielye as a code owner May 12, 2026 15:27

spypsy added 5 commits May 12, 2026 15:36

python files instead of inline scripts

0f40316

fix docker image for deploy contracts job

1dcd998

alexghr reviewed May 12, 2026

View reviewed changes

spypsy added 4 commits May 13, 2026 15:11

update with TS script

76e399a

fix(spartan): env spread must not clobber array values with string en…

5f76045

…v vars

fix(spartan): env spread must not clobber GCP secret placeholders

c7cc534

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(spartan): unify deployment config sources#22847

refactor(spartan): unify deployment config sources#22847
spypsy wants to merge 38 commits into
merge-train/spartanfrom
spy/deployment-config

spypsy commented Apr 29, 2026 •

edited

Loading

Uh oh!

alexghr left a comment

Uh oh!

alexghr May 12, 2026

Uh oh!

alexghr May 12, 2026

Uh oh!

alexghr May 12, 2026

Uh oh!

alexghr May 12, 2026

Uh oh!

alexghr May 12, 2026

Uh oh!

spypsy May 13, 2026

Uh oh!

alexghr May 12, 2026

Uh oh!

alexghr May 12, 2026

Uh oh!

alexghr May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spypsy commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changes

Where to look first

Risk / non-obvious behavior

Uh oh!

alexghr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spypsy commented Apr 29, 2026 •

edited

Loading