Skip to content

refactor(spartan): unify deployment config sources#22847

Open
spypsy wants to merge 38 commits into
merge-train/spartanfrom
spy/deployment-config
Open

refactor(spartan): unify deployment config sources#22847
spypsy wants to merge 38 commits into
merge-train/spartanfrom
spy/deployment-config

Conversation

@spypsy
Copy link
Copy Markdown
Member

@spypsy spypsy commented Apr 29, 2026

Why

Pre-PR, deployment config was scattered across:

  • spartan/environments/<network>.env (one .env file per network, sourced as bash)
  • spartan/terraform/deploy-aztec-infra/variables.tf (~944 lines of variable "FOO_BAR" declarations + defaults)
  • spartan/scripts/setup_gcp_secrets.sh (case-by-case REPLACE_WITH_GCP_SECRET resolution)
  • per-camelCase Helm value mappings hand-written in main.tf to translate var.X → Helm set block → pod env

Adding or changing a single setting often touched 3 of those files. There was no single place to look up "what does network X actually deploy with".

What changes

  • YAML is the single source of truth. spartan/environments/network-defaults.yml defines _deploy_defaults, _release_defaults, and networks.<name>.env. Per-network YAMLs at spartan/environments/networks/<name>.yml declare three blocks: deploy: (consumed by deploy script + Terraform var.deploy), env: (pod env baseline), and per-release blocks (mirrors Helm chart shape, e.g. validator: { replicaCount, env }, prover: { node, broker, agent }).
  • All .env files removed. The 16 spartan/environments/*.env files are replaced by spartan/environments/networks/*.yml.
  • Loader (scripts/load_network_config.sh) deep-merges defaults + preset env baseline + per-network YAML, expands ${VAR} placeholders, applies derived values (devnet's MNEMONIC_INDEX_OFFSET), resolves GCP secrets, and emits in --format=env / --format=json / --format=tfvars. The three transform stages live as small Python scripts under scripts/ (expand_placeholders.py, apply_derived.py, resolve_secrets.py) — each reads JSON on stdin, writes JSON on stdout.
  • Charts: aztec-node gains a generic env.configmap.yaml built from .Values.env and mounted via envFrom. Wrapper charts (aztec-validator, aztec-bot) delegate to the subchart's env via the alias key (validator.env, bot.env); their redundant per-chart env templates are removed.
  • Terraform: variables.tf shrinks from 944 → 22 lines. The module reads three structured inputs — var.deploy (any), var.env (map), var.releases (any) — and forwards per-release values via yamlencode(var.releases[<release>]). Same shape applied to deploy-rollup-contracts/.
  • Scripts/CI: source_network_env.sh and source_env_basic.sh wrap the loader; existing callers (bootstrap.sh, test_kind.sh, GHA workflows) keep their CLI signatures. setup_gcp_secrets.sh is gone — resolve_secrets.py does the same job inline.
  • Codegen consumers updated for the .networks.<name>.env.* path (CLI generate.sh, noir-contracts/bootstrap.sh, l1-contracts/scripts/load_network_defaults.sh).

Where to look first

  1. spartan/environments/network-defaults.yml — the unified defaults file, including comments at the top describing how it's consumed.
  2. spartan/scripts/load_network_config.sh — the merge pipeline (env baseline → preset → per-network YAML → placeholder expansion → derived values → secret resolution → format emission).
  3. spartan/terraform/deploy-aztec-infra/main.tf — see how var.deploy / var.env / var.releases flow into helm_release resources.
  4. spartan/environments/networks/devnet.yml (or testnet.yml) — examples of a real per-network override.
  5. spartan/CLAUDE.md — updated docs describing the new flow.

Risk / non-obvious behavior

  • Some generated CLI defaults may differ intentionally where _prodlike baseline was expanded for consistency. CLI consumers updated to the .networks.<name>.env.* path.
  • networks.<name>.env is the joint codegen + runtime baseline. Editing values there changes both shipped TS artifacts AND deployed pod env — regenerate after changes (yarn-project/{ethereum,slasher,cli} && yarn generate; cd l1-contracts && ./bootstrap.sh).
  • apply_derived.py fails loudly if VALIDATOR_MNEMONIC_START_INDEX (and friends) appear under env: instead of deploy:, since the namespace-derived MNEMONIC_INDEX_OFFSET shift only applies to the deploy block — silently mis-placing them would cause concurrent devnets to collide on L1 nonces.

@spypsy spypsy changed the base branch from next to merge-train/spartan April 29, 2026 13:16
@spypsy spypsy force-pushed the spy/deployment-config branch from 14ff0fc to 25c157d Compare April 29, 2026 13:27
spypsy added 15 commits April 29, 2026 15:50
- Add `_deploy_defaults` block to `network-defaults.yml` mirroring the
  static defaults that previously lived in `deploy_network.sh`'s
  `${VAR:-default}` block. Loader seeds the `deploy:` baseline from it.
- Strip the duplicated bash defaults block in `deploy_network.sh`;
  keep only deploy-time-derived values (RESOURCE_PROFILE cascade,
  validator addresses, mnemonic ranges, tf_str helpers, namespace
  required-var assertion).
- Rename `network_deploy.sh` -> `deploy_network_with_env.sh` to make
  the entrypoint vs inner-script split obvious. Update callers in
  `bootstrap.sh`, the `deploy-network.yml` workflow, and `test_kind.sh`
  (which now uses the YAML loader instead of `.env` sourcing).

Codegen output (CLI / ethereum / slasher) is byte-identical.
The Python `resolve_secrets` helper inside `load_network_config.sh`
wrote `::add-mask::SECRET` workflow commands to stdout. That stream is
captured into `merged_json` and then piped to `jq`, so the mask lines
prefixed the JSON and broke every downstream `jq` pipeline with a
"parse error: Expected string key before ':'", leaving env-var
placeholders unresolved and triggering GHA's "the word
REPLACE_WITH_GCP_SECRET is invalid" errors during deploys.

- Route mask commands and diagnostics to stderr only.
- Split JSON-array secrets (`ETHEREUM_RPC_URLS`, consensus arrays) into
  per-element mask commands, matching the legacy setup_gcp_secrets.sh
  behavior.
- Cache fetched secrets to avoid repeated gcloud calls (R2 account id
  is now used for snapshot/blob/tx URL construction below).
- Construct STORE_SNAPSHOT_URL / BLOB_FILE_STORE_UPLOAD_URL /
  TX_FILE_STORE_URL from the corresponding *_BUCKET_DIRECTORY inputs +
  the r2-account-id secret -- another piece that was missing from the
  YAML loader and was breaking real deploys.
- Delete the now-orphan `setup_gcp_secrets.sh` (last consumer was the
  removed `.env` path; only referenced by comments now).
Cuts ~900 lines from terraform/deploy-aztec-infra/variables.tf by replacing
all ~70 legacy individual `variable "X"` declarations with three structured
inputs:

  variable "deploy"   { type = any }            # YAML deploy: + computed
  variable "env"      { type = map(string) }    # YAML env: pod baseline
  variable "releases" { type = any }            # YAML per-release blocks

main.tf now reads everything via local.d.<KEY> (alias for var.deploy) with
appropriate tonumber()/tobool()/try() coercions at the boundary -- the YAML
loader emits all values as strings.

deploy_network.sh writes terraform.tfvars.json (instead of HCL tfvars) by
running load_network_config.sh --format=tfvars, then jq-merging deploy-time
computed values (cluster context, image overrides, contract addresses,
admin API key hash, mnemonic plumbing, P2P cluster gating, L1 endpoints)
into the deploy block. R2-derived URLs from resolve_secrets are promoted
from env: into deploy: where main.tf gates helm releases on them.

Plumbing:
- deploy_network.sh now takes <network> arg (passed by deploy_network_with_env.sh
  and test_kind.sh) so it can re-invoke the loader for the structured JSON.

Validation:
- terraform validate passes.
- terraform plan against generated tfvars.json resolves all variables and
  proceeds to k8s provider connect (failing only on cluster reachability,
  as expected without a real kube context).
The previous commit's `inline_values = [yamlencode(var.releases[each.key])]`
pass-through was wrong for two reasons that surfaced as 30-min helm-release
timeouts on next-net (rpc / validators / prover never became Ready):

1. Subchart aliasing. aztec-validator and aztec-bot wrap aztec-node as a
   subchart aliased `validator` / `bot`. Values must be nested under that
   key for the env ConfigMap (.Values.env -> *-env-from-values) to render.
   The loader emits release blocks at the top level, so passing them
   through verbatim landed env on the wrapper chart instead of the
   aztec-node subchart -- pods came up with NO env vars from the loader
   pipeline, crashlooped, and helm timed out waiting for Ready.

2. validators-* / validator name mismatch. main.tf creates helm release
   keys `validators` / `validators-ha-N` (plural + HA index), but the
   loader emits a single `validator` (singular) block matching the YAML's
   `_release_defaults.validator`. `var.releases.validators` was always
   absent, so validator pods got nothing.

Fix: introduce `local.release_values_from_loader` that maps each helm
release name to its properly-wrapped values, then yamlencode that.
- validators*  -> { validator = var.releases["validator"] }
- bot_*        -> { bot = var.releases["bot_*"] }
- prover       -> direct (loader output already nests node/broker/agent)
- rpc, archive, blob_sink, full_node, fisherman, p2p_bootstrap -> direct
  (single aztec-node chart, top-level .Values.env is correct)

Verified via `helm template`: <release>-env-from-values ConfigMap now
renders for validator/prover/bot/rpc with all expected env keys (e.g.
AZTEC_SLOT_DURATION, FUNDING_PRIVATE_KEY, ...), and pod envFrom
correctly references it.
Gate the Slack/ClaudeBox failure notification on
`steps.checkout-ref.outputs.ref == 'refs/heads/next'` so ad-hoc deploys
from feature branches (and release tags) don't page the team.
Commit 14ff0fc's `_pod-template.yaml` simplification dropped the
`{{- range $key, $value := .Values.node.env }}` loop in favour of the
new generic `<release>-env-from-values` ConfigMap built from
`.Values.env`. That broke every per-release deploy-time-computed env
var that flows through Terraform `set` blocks under `<chart>.node.env.*`:

- validators: PUBLISHER_KEY_INDEX_START (per-replica HA offset),
  VALIDATOR_HA_REPLICA_INDEX, VALIDATOR_HA_SIGNING_ENABLED,
  VALIDATOR_HA_DATABASE_URL, VALIDATOR_HA_POOL_MAX,
  VALIDATOR_HA_OLD_DUTIES_MAX_AGE_H
- prover: PUBLISHER_KEY_INDEX_START

These can't live in the shared `env-from-values` ConfigMap because
their values differ across `validators-ha-N` releases that share the
same chart -- we'd collide on the ConfigMap name.

next-net pods crashed in setup-attester-keystore.sh with
`PUBLISHER_KEY_INDEX_START: unbound variable`, helm waited 30 minutes
for Ready, then the deploy timed out.

Restore the inline `node.env` loop. The new `env-from-values`
ConfigMap (for shared/baseline env from the YAML loader) coexists
with it -- both are envFrom-mounted, plus this loop is inline `env`
so per-release values still work.
The previous commit put deploy-time env vars (REGISTRY_CONTRACT_ADDRESS,
L1_CHAIN_ID, ETHEREUM_HOSTS, OTEL endpoints, ...) in a single
`common_inline_values = yamlencode({env = ...})` that was used for every
helm release. For aztec-node charts (rpc, archive, ...) this correctly
maps to .Values.env -> env-from-values ConfigMap. For wrapper charts it
does NOT:

- aztec-validator aliases aztec-node as `validator`, so env must be
  at `validator.env.*` to reach the subchart's .Values.env.
- aztec-prover-stack aliases aztec-node three times (node/broker/agent),
  so env must be at `node.env.*`, `broker.env.*`, `agent.env.*`.
- aztec-bot aliases aztec-node as `bot`.

Result on next-net: validators crashed with
  "L1 registry address is required to start Aztec Node"
because REGISTRY_CONTRACT_ADDRESS was in the wrapper's .Values.env
(which nothing reads) rather than .Values.validator.env.

Fix: factor common_env_block as a plain map, then generate per-chart-type
inline values:
  common_inline_values           -> { env: ... }           (aztec-node charts)
  common_inline_values_validator -> { validator: { env } } (aztec-validator)
  common_inline_values_prover    -> { node/broker/agent: { env } } (prover-stack)
  common_inline_values_bot       -> { bot: { env } }       (aztec-bot)

Selected per release key in the helm_release values concat.

Verified with `helm template config-net-validator aztec-validator` that
REGISTRY_CONTRACT_ADDRESS and ETHEREUM_HOSTS now appear in the
env-from-values ConfigMap data section.
The blob/tx filestore client uses the AWS SDK which reads AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY by default. All YAML network files and
network-defaults.yml stored credentials as R2_ACCESS_KEY_ID / R2_SECRET_ACCESS_KEY,
which was renamed to AWS_* only via an explicit Terraform set block for
rpc/full_node/blob_sink -- validators and other charts never got the rename
and crashed with CredentialsProviderError at startup.

Rename everywhere to AWS_* so no rename mapping is needed:
- All spartan/environments/networks/*.yml and network-defaults.yml
- load_network_config.sh secret_name_for mapping (key name changes; GCP
  secret names r2-access-key-id / r2-secret-access-key stay the same)
- main.tf local.d.R2_* references updated to local.d.AWS_*
…ty string

node.env.AWS_ACCESS_KEY_ID and node.env.AWS_SECRET_ACCESS_KEY were being
set via Terraform custom_settings for rpc, full_node, and blob_sink as:
  try(local.d.AWS_ACCESS_KEY_ID, "")

`local.d` is var.deploy; the AWS credentials live in var.env (the `env:`
YAML block), not `deploy:`. So try() always returned "", and the explicit
`set` block emitted an inline env: entry on the pod that overrides envFrom.
Kubernetes inline env takes precedence over envFrom, so the credentials
from the env-from-values ConfigMap (resolved from GCP secrets by the loader)
were silently clobbered with empty strings, causing CredentialsProviderError.

The fix: remove the three set blocks. Credentials now flow entirely via
the YAML loader's env: block -> env-from-values ConfigMap -> pod envFrom.
@spypsy spypsy force-pushed the spy/deployment-config branch from c5f59a0 to ec149cc Compare May 6, 2026 14:10
@spypsy spypsy force-pushed the spy/deployment-config branch from b810ad5 to ab7ed4b Compare May 12, 2026 15:02
spypsy added 3 commits May 12, 2026 15:14
- Merge source_env_basic.sh into source_network_env.sh; old path is a shim.
- Remove BOOTSTRAP_NODES="asdf" placeholder from broker/agent defaults.
- Drop unused empty-string defaults from _deploy_defaults.
- Fail loudly if mnemonic *_START_INDEX is set under env: instead of deploy:.
@spypsy spypsy marked this pull request as ready for review May 12, 2026 15:27
@spypsy spypsy requested a review from charlielye as a code owner May 12, 2026 15:27
spypsy added 5 commits May 12, 2026 15:36
Remove ~30 lines of bash pre-computation from deploy_network.sh:
- Drop the 8 *_RESOURCE_PROFILE cascade vars (kept RESOURCE_PROFILE for eth-devnet)
- Drop the P2P_NODEPORT_ENABLED/P2P_PUBLIC_IP if/else block
- Drop PROVER_REAL_PROOFS (was never consumed by Terraform)
- Collapse 4 mnemonic alias args (VALIDATOR/PROVER/BOT/FISHERMAN_MNEMONIC)
  down to a single LABS_INFRA_MNEMONIC passed through DEPLOY_OVERRIDES

In main.tf, derive the removed values from existing context:
- p2p_nodeport_enabled / p2p_public_ip from local.is_kind
- Per-release resource profiles via try(local.d.X, is_kind ? "dev" : "prod"),
  so per-network YAML overrides (e.g. testnet prod-spot) still take effect
- All mnemonic references now use local.d.LABS_INFRA_MNEMONIC directly
SPONSORED_FPC, TEST_ACCOUNTS, and REAL_VERIFIER live under env: in
network YAMLs (so pods get them), but deploy-rollup-contracts/main.tf
was reading them from var.deploy — falling back to _deploy_defaults
(all false/true) instead of the per-network value.

Result: contracts deployed without --test-accounts/--sponsored-fpc but
validators running with TEST_ACCOUNTS=true/SPONSORED_FPC=true computed
a different genesis archive root, putting the network in standby.

Fix: prefer var.env for these three flags (env wins, deploy is fallback).
…n.tf

b10ff73 accidentally renamed both keys to a non-existent
LABS_INFRA_MNEMONIC_START_INDEX (confusing it with the LABS_INFRA_MNEMONIC
secret env var, which has no _START_INDEX sibling). The loader still
emits the original keys.
Copy link
Copy Markdown
Contributor

@alexghr alexghr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just replace the python scripts with nodejs for consistency and get rid of yaml's bash-style env var override in favour of overriding from the environment.

Comment thread .github/workflows/deploy-network.yml Outdated

# need to source this for CLUSTER
source "./environments/${{ inputs.network }}.env"
# Source for CLUSTER (prefers YAML loader, falls back to legacy .env).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the .env files are gone?

@@ -0,0 +1,63 @@
deploy:
NAMESPACE: "${NAMESPACE:-alpha-net}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scares me.

Comment thread spartan/scripts/apply_derived.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, can this be bash or Nodejs rather than Python? IIUC it's only used to offset mnemonic indices for devnet releases.

Comment thread spartan/scripts/expand_placeholders.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I'd make this bash/nodejs.

I would also consider replacing this script and piecemeal env overloading with top-level JS-style overloading e.g. {...valuesWeGotFromYaml, ...valuesWeGotFromEnv}

Comment thread spartan/scripts/resolve_secrets.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next PR: we replace this script with external secrets

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, thought about doing it but I think this PR is big enough already 😅

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 !

Comment on lines +63 to +71
network = try(local.d.NETWORK, "")
store_snapshot_url = try(var.env.STORE_SNAPSHOT_URL, "")
blob_file_store_upload_url = try(var.env.BLOB_FILE_STORE_UPLOAD_URL, "")
prover_agent_image_str = try(local.d.PROVER_AGENT_DOCKER_IMAGE, "")
validator_ha_image_str = try(local.d.VALIDATOR_HA_DOCKER_IMAGE, "")
otel_endpoint = try(local.d.OTEL_COLLECTOR_ENDPOINT, "")
rpc_cloud_armor = try(local.d.RPC_CLOUD_ARMOR_POLICY_NAME, "")
rpc_session_affinity = try(local.d.RPC_INGRESS_SESSION_AFFINITY, "")
external_bootnodes = try(local.d.EXTERNAL_BOOTNODES, [])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a step in the right direction but eventually I'd want this to become just a pass-through: take a YAML string and TF just pass it on to helm

spypsy added 4 commits May 13, 2026 15:11
…override model

- Replace expand_placeholders.py, apply_derived.py, resolve_secrets.py with TS equivalents
- Remove \${VAR:-default} placeholder syntax from all network YAMLs; values are now plain literals
- Add env spread in apply_derived.ts: shell env wins for any key in deploy:/env: blocks
- Add derived: YAML sections for computed values (devnet ingress/bucket paths, TX collection URLs)
- apply_derived.ts is now generic -- no network-specific logic hardcoded in TS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants