refactor(spartan): unify deployment config sources#22847
Conversation
14ff0fc to
25c157d
Compare
- Add `_deploy_defaults` block to `network-defaults.yml` mirroring the
static defaults that previously lived in `deploy_network.sh`'s
`${VAR:-default}` block. Loader seeds the `deploy:` baseline from it.
- Strip the duplicated bash defaults block in `deploy_network.sh`;
keep only deploy-time-derived values (RESOURCE_PROFILE cascade,
validator addresses, mnemonic ranges, tf_str helpers, namespace
required-var assertion).
- Rename `network_deploy.sh` -> `deploy_network_with_env.sh` to make
the entrypoint vs inner-script split obvious. Update callers in
`bootstrap.sh`, the `deploy-network.yml` workflow, and `test_kind.sh`
(which now uses the YAML loader instead of `.env` sourcing).
Codegen output (CLI / ethereum / slasher) is byte-identical.
The Python `resolve_secrets` helper inside `load_network_config.sh` wrote `::add-mask::SECRET` workflow commands to stdout. That stream is captured into `merged_json` and then piped to `jq`, so the mask lines prefixed the JSON and broke every downstream `jq` pipeline with a "parse error: Expected string key before ':'", leaving env-var placeholders unresolved and triggering GHA's "the word REPLACE_WITH_GCP_SECRET is invalid" errors during deploys. - Route mask commands and diagnostics to stderr only. - Split JSON-array secrets (`ETHEREUM_RPC_URLS`, consensus arrays) into per-element mask commands, matching the legacy setup_gcp_secrets.sh behavior. - Cache fetched secrets to avoid repeated gcloud calls (R2 account id is now used for snapshot/blob/tx URL construction below). - Construct STORE_SNAPSHOT_URL / BLOB_FILE_STORE_UPLOAD_URL / TX_FILE_STORE_URL from the corresponding *_BUCKET_DIRECTORY inputs + the r2-account-id secret -- another piece that was missing from the YAML loader and was breaking real deploys. - Delete the now-orphan `setup_gcp_secrets.sh` (last consumer was the removed `.env` path; only referenced by comments now).
Cuts ~900 lines from terraform/deploy-aztec-infra/variables.tf by replacing
all ~70 legacy individual `variable "X"` declarations with three structured
inputs:
variable "deploy" { type = any } # YAML deploy: + computed
variable "env" { type = map(string) } # YAML env: pod baseline
variable "releases" { type = any } # YAML per-release blocks
main.tf now reads everything via local.d.<KEY> (alias for var.deploy) with
appropriate tonumber()/tobool()/try() coercions at the boundary -- the YAML
loader emits all values as strings.
deploy_network.sh writes terraform.tfvars.json (instead of HCL tfvars) by
running load_network_config.sh --format=tfvars, then jq-merging deploy-time
computed values (cluster context, image overrides, contract addresses,
admin API key hash, mnemonic plumbing, P2P cluster gating, L1 endpoints)
into the deploy block. R2-derived URLs from resolve_secrets are promoted
from env: into deploy: where main.tf gates helm releases on them.
Plumbing:
- deploy_network.sh now takes <network> arg (passed by deploy_network_with_env.sh
and test_kind.sh) so it can re-invoke the loader for the structured JSON.
Validation:
- terraform validate passes.
- terraform plan against generated tfvars.json resolves all variables and
proceeds to k8s provider connect (failing only on cluster reachability,
as expected without a real kube context).
The previous commit's `inline_values = [yamlencode(var.releases[each.key])]`
pass-through was wrong for two reasons that surfaced as 30-min helm-release
timeouts on next-net (rpc / validators / prover never became Ready):
1. Subchart aliasing. aztec-validator and aztec-bot wrap aztec-node as a
subchart aliased `validator` / `bot`. Values must be nested under that
key for the env ConfigMap (.Values.env -> *-env-from-values) to render.
The loader emits release blocks at the top level, so passing them
through verbatim landed env on the wrapper chart instead of the
aztec-node subchart -- pods came up with NO env vars from the loader
pipeline, crashlooped, and helm timed out waiting for Ready.
2. validators-* / validator name mismatch. main.tf creates helm release
keys `validators` / `validators-ha-N` (plural + HA index), but the
loader emits a single `validator` (singular) block matching the YAML's
`_release_defaults.validator`. `var.releases.validators` was always
absent, so validator pods got nothing.
Fix: introduce `local.release_values_from_loader` that maps each helm
release name to its properly-wrapped values, then yamlencode that.
- validators* -> { validator = var.releases["validator"] }
- bot_* -> { bot = var.releases["bot_*"] }
- prover -> direct (loader output already nests node/broker/agent)
- rpc, archive, blob_sink, full_node, fisherman, p2p_bootstrap -> direct
(single aztec-node chart, top-level .Values.env is correct)
Verified via `helm template`: <release>-env-from-values ConfigMap now
renders for validator/prover/bot/rpc with all expected env keys (e.g.
AZTEC_SLOT_DURATION, FUNDING_PRIVATE_KEY, ...), and pod envFrom
correctly references it.
Gate the Slack/ClaudeBox failure notification on `steps.checkout-ref.outputs.ref == 'refs/heads/next'` so ad-hoc deploys from feature branches (and release tags) don't page the team.
Commit 14ff0fc's `_pod-template.yaml` simplification dropped the `{{- range $key, $value := .Values.node.env }}` loop in favour of the new generic `<release>-env-from-values` ConfigMap built from `.Values.env`. That broke every per-release deploy-time-computed env var that flows through Terraform `set` blocks under `<chart>.node.env.*`: - validators: PUBLISHER_KEY_INDEX_START (per-replica HA offset), VALIDATOR_HA_REPLICA_INDEX, VALIDATOR_HA_SIGNING_ENABLED, VALIDATOR_HA_DATABASE_URL, VALIDATOR_HA_POOL_MAX, VALIDATOR_HA_OLD_DUTIES_MAX_AGE_H - prover: PUBLISHER_KEY_INDEX_START These can't live in the shared `env-from-values` ConfigMap because their values differ across `validators-ha-N` releases that share the same chart -- we'd collide on the ConfigMap name. next-net pods crashed in setup-attester-keystore.sh with `PUBLISHER_KEY_INDEX_START: unbound variable`, helm waited 30 minutes for Ready, then the deploy timed out. Restore the inline `node.env` loop. The new `env-from-values` ConfigMap (for shared/baseline env from the YAML loader) coexists with it -- both are envFrom-mounted, plus this loop is inline `env` so per-release values still work.
The previous commit put deploy-time env vars (REGISTRY_CONTRACT_ADDRESS,
L1_CHAIN_ID, ETHEREUM_HOSTS, OTEL endpoints, ...) in a single
`common_inline_values = yamlencode({env = ...})` that was used for every
helm release. For aztec-node charts (rpc, archive, ...) this correctly
maps to .Values.env -> env-from-values ConfigMap. For wrapper charts it
does NOT:
- aztec-validator aliases aztec-node as `validator`, so env must be
at `validator.env.*` to reach the subchart's .Values.env.
- aztec-prover-stack aliases aztec-node three times (node/broker/agent),
so env must be at `node.env.*`, `broker.env.*`, `agent.env.*`.
- aztec-bot aliases aztec-node as `bot`.
Result on next-net: validators crashed with
"L1 registry address is required to start Aztec Node"
because REGISTRY_CONTRACT_ADDRESS was in the wrapper's .Values.env
(which nothing reads) rather than .Values.validator.env.
Fix: factor common_env_block as a plain map, then generate per-chart-type
inline values:
common_inline_values -> { env: ... } (aztec-node charts)
common_inline_values_validator -> { validator: { env } } (aztec-validator)
common_inline_values_prover -> { node/broker/agent: { env } } (prover-stack)
common_inline_values_bot -> { bot: { env } } (aztec-bot)
Selected per release key in the helm_release values concat.
Verified with `helm template config-net-validator aztec-validator` that
REGISTRY_CONTRACT_ADDRESS and ETHEREUM_HOSTS now appear in the
env-from-values ConfigMap data section.
The blob/tx filestore client uses the AWS SDK which reads AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY by default. All YAML network files and network-defaults.yml stored credentials as R2_ACCESS_KEY_ID / R2_SECRET_ACCESS_KEY, which was renamed to AWS_* only via an explicit Terraform set block for rpc/full_node/blob_sink -- validators and other charts never got the rename and crashed with CredentialsProviderError at startup. Rename everywhere to AWS_* so no rename mapping is needed: - All spartan/environments/networks/*.yml and network-defaults.yml - load_network_config.sh secret_name_for mapping (key name changes; GCP secret names r2-access-key-id / r2-secret-access-key stay the same) - main.tf local.d.R2_* references updated to local.d.AWS_*
…ty string node.env.AWS_ACCESS_KEY_ID and node.env.AWS_SECRET_ACCESS_KEY were being set via Terraform custom_settings for rpc, full_node, and blob_sink as: try(local.d.AWS_ACCESS_KEY_ID, "") `local.d` is var.deploy; the AWS credentials live in var.env (the `env:` YAML block), not `deploy:`. So try() always returned "", and the explicit `set` block emitted an inline env: entry on the pod that overrides envFrom. Kubernetes inline env takes precedence over envFrom, so the credentials from the env-from-values ConfigMap (resolved from GCP secrets by the loader) were silently clobbered with empty strings, causing CredentialsProviderError. The fix: remove the three set blocks. Credentials now flow entirely via the YAML loader's env: block -> env-from-values ConfigMap -> pod envFrom.
c5f59a0 to
ec149cc
Compare
…col/aztec-packages into spy/deployment-config
b810ad5 to
ab7ed4b
Compare
- Merge source_env_basic.sh into source_network_env.sh; old path is a shim. - Remove BOOTSTRAP_NODES="asdf" placeholder from broker/agent defaults. - Drop unused empty-string defaults from _deploy_defaults. - Fail loudly if mnemonic *_START_INDEX is set under env: instead of deploy:.
Remove ~30 lines of bash pre-computation from deploy_network.sh: - Drop the 8 *_RESOURCE_PROFILE cascade vars (kept RESOURCE_PROFILE for eth-devnet) - Drop the P2P_NODEPORT_ENABLED/P2P_PUBLIC_IP if/else block - Drop PROVER_REAL_PROOFS (was never consumed by Terraform) - Collapse 4 mnemonic alias args (VALIDATOR/PROVER/BOT/FISHERMAN_MNEMONIC) down to a single LABS_INFRA_MNEMONIC passed through DEPLOY_OVERRIDES In main.tf, derive the removed values from existing context: - p2p_nodeport_enabled / p2p_public_ip from local.is_kind - Per-release resource profiles via try(local.d.X, is_kind ? "dev" : "prod"), so per-network YAML overrides (e.g. testnet prod-spot) still take effect - All mnemonic references now use local.d.LABS_INFRA_MNEMONIC directly
SPONSORED_FPC, TEST_ACCOUNTS, and REAL_VERIFIER live under env: in network YAMLs (so pods get them), but deploy-rollup-contracts/main.tf was reading them from var.deploy — falling back to _deploy_defaults (all false/true) instead of the per-network value. Result: contracts deployed without --test-accounts/--sponsored-fpc but validators running with TEST_ACCOUNTS=true/SPONSORED_FPC=true computed a different genesis archive root, putting the network in standby. Fix: prefer var.env for these three flags (env wins, deploy is fallback).
…n.tf b10ff73 accidentally renamed both keys to a non-existent LABS_INFRA_MNEMONIC_START_INDEX (confusing it with the LABS_INFRA_MNEMONIC secret env var, which has no _START_INDEX sibling). The loader still emits the original keys.
alexghr
left a comment
There was a problem hiding this comment.
I would just replace the python scripts with nodejs for consistency and get rid of yaml's bash-style env var override in favour of overriding from the environment.
|
|
||
| # need to source this for CLUSTER | ||
| source "./environments/${{ inputs.network }}.env" | ||
| # Source for CLUSTER (prefers YAML loader, falls back to legacy .env). |
There was a problem hiding this comment.
I think the .env files are gone?
| @@ -0,0 +1,63 @@ | |||
| deploy: | |||
| NAMESPACE: "${NAMESPACE:-alpha-net}" | |||
There was a problem hiding this comment.
Hm, can this be bash or Nodejs rather than Python? IIUC it's only used to offset mnemonic indices for devnet releases.
There was a problem hiding this comment.
Same, I'd make this bash/nodejs.
I would also consider replacing this script and piecemeal env overloading with top-level JS-style overloading e.g. {...valuesWeGotFromYaml, ...valuesWeGotFromEnv}
There was a problem hiding this comment.
next PR: we replace this script with external secrets
There was a problem hiding this comment.
yeah, thought about doing it but I think this PR is big enough already 😅
| network = try(local.d.NETWORK, "") | ||
| store_snapshot_url = try(var.env.STORE_SNAPSHOT_URL, "") | ||
| blob_file_store_upload_url = try(var.env.BLOB_FILE_STORE_UPLOAD_URL, "") | ||
| prover_agent_image_str = try(local.d.PROVER_AGENT_DOCKER_IMAGE, "") | ||
| validator_ha_image_str = try(local.d.VALIDATOR_HA_DOCKER_IMAGE, "") | ||
| otel_endpoint = try(local.d.OTEL_COLLECTOR_ENDPOINT, "") | ||
| rpc_cloud_armor = try(local.d.RPC_CLOUD_ARMOR_POLICY_NAME, "") | ||
| rpc_session_affinity = try(local.d.RPC_INGRESS_SESSION_AFFINITY, "") | ||
| external_bootnodes = try(local.d.EXTERNAL_BOOTNODES, []) |
There was a problem hiding this comment.
This is a step in the right direction but eventually I'd want this to become just a pass-through: take a YAML string and TF just pass it on to helm
…override model
- Replace expand_placeholders.py, apply_derived.py, resolve_secrets.py with TS equivalents
- Remove \${VAR:-default} placeholder syntax from all network YAMLs; values are now plain literals
- Add env spread in apply_derived.ts: shell env wins for any key in deploy:/env: blocks
- Add derived: YAML sections for computed values (devnet ingress/bucket paths, TX collection URLs)
- apply_derived.ts is now generic -- no network-specific logic hardcoded in TS
Why
Pre-PR, deployment config was scattered across:
spartan/environments/<network>.env(one.envfile per network, sourced as bash)spartan/terraform/deploy-aztec-infra/variables.tf(~944 lines ofvariable "FOO_BAR"declarations + defaults)spartan/scripts/setup_gcp_secrets.sh(case-by-caseREPLACE_WITH_GCP_SECRETresolution)main.tfto translatevar.X→ Helmsetblock → pod envAdding or changing a single setting often touched 3 of those files. There was no single place to look up "what does network X actually deploy with".
What changes
spartan/environments/network-defaults.ymldefines_deploy_defaults,_release_defaults, andnetworks.<name>.env. Per-network YAMLs atspartan/environments/networks/<name>.ymldeclare three blocks:deploy:(consumed by deploy script + Terraformvar.deploy),env:(pod env baseline), and per-release blocks (mirrors Helm chart shape, e.g.validator: { replicaCount, env },prover: { node, broker, agent })..envfiles removed. The 16spartan/environments/*.envfiles are replaced byspartan/environments/networks/*.yml.scripts/load_network_config.sh) deep-merges defaults + preset env baseline + per-network YAML, expands${VAR}placeholders, applies derived values (devnet'sMNEMONIC_INDEX_OFFSET), resolves GCP secrets, and emits in--format=env/--format=json/--format=tfvars. The three transform stages live as small Python scripts underscripts/(expand_placeholders.py,apply_derived.py,resolve_secrets.py) — each reads JSON on stdin, writes JSON on stdout.aztec-nodegains a genericenv.configmap.yamlbuilt from.Values.envand mounted viaenvFrom. Wrapper charts (aztec-validator,aztec-bot) delegate to the subchart's env via the alias key (validator.env,bot.env); their redundant per-chart env templates are removed.variables.tfshrinks from 944 → 22 lines. The module reads three structured inputs —var.deploy(any),var.env(map),var.releases(any) — and forwards per-release values viayamlencode(var.releases[<release>]). Same shape applied todeploy-rollup-contracts/.source_network_env.shandsource_env_basic.shwrap the loader; existing callers (bootstrap.sh,test_kind.sh, GHA workflows) keep their CLI signatures.setup_gcp_secrets.shis gone —resolve_secrets.pydoes the same job inline..networks.<name>.env.*path (CLIgenerate.sh,noir-contracts/bootstrap.sh,l1-contracts/scripts/load_network_defaults.sh).Where to look first
spartan/environments/network-defaults.yml— the unified defaults file, including comments at the top describing how it's consumed.spartan/scripts/load_network_config.sh— the merge pipeline (env baseline → preset → per-network YAML → placeholder expansion → derived values → secret resolution → format emission).spartan/terraform/deploy-aztec-infra/main.tf— see howvar.deploy/var.env/var.releasesflow intohelm_releaseresources.spartan/environments/networks/devnet.yml(ortestnet.yml) — examples of a real per-network override.spartan/CLAUDE.md— updated docs describing the new flow.Risk / non-obvious behavior
_prodlikebaseline was expanded for consistency. CLI consumers updated to the.networks.<name>.env.*path.networks.<name>.envis the joint codegen + runtime baseline. Editing values there changes both shipped TS artifacts AND deployed pod env — regenerate after changes (yarn-project/{ethereum,slasher,cli} && yarn generate; cd l1-contracts && ./bootstrap.sh).apply_derived.pyfails loudly ifVALIDATOR_MNEMONIC_START_INDEX(and friends) appear underenv:instead ofdeploy:, since the namespace-derivedMNEMONIC_INDEX_OFFSETshift only applies to the deploy block — silently mis-placing them would cause concurrent devnets to collide on L1 nonces.