High-Level Engineering Standards & Architecture Guide
Version: 1.0 | Last Updated: 2026-01-25 | Author: Guilherme Biff Zarelli
A production-ready, cloud-native platform engineered for enterprises that need to scale fast without sacrificing security, reliability, or developer productivity. Built for teams that refuse to choose between moving quickly and building things right.
Modern enterprises face a fundamental challenge: scaling infrastructure and teams simultaneously while maintaining governance and operational excellence. This platform solves that by providing a self-service foundation where developers ship features—not fight infrastructure—while platform teams maintain control, visibility, and compliance.
| Pillar | What You Get |
|---|---|
| 🚀 Developer Experience | Self-service deployments, automated scaffolding via Developer Portal, zero-friction onboarding. New services go from idea to production in hours, not weeks. |
| 📈 Scalability & Resilience | Multi-region architecture with automatic failover, auto-scaling workloads, and zero-downtime deployments. Built to handle traffic spikes and regional outages gracefully. |
| 🔐 Security & Compliance | Zero-trust networking with mTLS everywhere, centralized identity management, and policy-as-code. Security is built-in, not bolted-on. |
| ⚙️ Operational Excellence | GitOps-driven workflows, unified observability (metrics, logs, traces), and cost transparency per team. Everything is auditable, reproducible, and automated. |
| Aspect | Approach |
|---|---|
| Infrastructure | AWS Multi-Account (HML + Prod), Multi-Region |
| Orchestration | Amazon EKS with GitOps (Argo CD) |
| Service Mesh | Istio with mTLS (zero-trust) |
| Identity | Keycloak (OIDC/OAuth2) |
| Observability | Grafana Stack (Mimir, Loki, Tempo) |
| Developer Portal | Backstage with automated scaffolding |
| Deployment | GitOps with phased multi-region rollout |
This platform enables organizations to accelerate time-to-market while reducing operational overhead. Teams onboard in days, deploy with confidence, and operate with full visibility—allowing the business to grow without infrastructure becoming a bottleneck.
Why this matters: A clear architectural overview enables teams to understand the system's structure at a glance, facilitating faster onboarding, better decision-making, and alignment across all stakeholders on how components interact.
┌─────────────────────────────────────────────────────────────────────────────┐
│ SHARED SERVICES ACCOUNT (helpdev-org-main) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Route53 │ │ CloudFront │ │ ECR │ │ Grafana │ │
│ │ (DNS/LB) │ │ (CDN) │ │ (us-east-1) │ │ Central │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ Cross-Account Access (IAM + Resource Policies) │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────┼─────────────────────────┐
▼ ▼ ▼
┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐
│ AWS Account: HML │ │ AWS Account: PROD │ │ AWS Account: PROD │
│ Region: us-east-1 │ │ Region: us-east-1 │ │ Region: sa-east-1 │
│ │ │ (Primary) │ │ (Secondary) │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ EKS Cluster │ │ │ │ EKS Cluster │ │ │ │ EKS Cluster │ │
│ │ eks-hml-use1 │ │ │ │ eks-prod-use1 │ │ │ │ eks-prod-sae1 │ │
│ └─────────────────┘ │ │ └─────────────────┘ │ │ └─────────────────┘ │
│ │ │ │ │ │
│ ┌─────┐ ┌─────────┐ │ │ ┌─────┐ ┌─────────┐ │ │ ┌─────┐ ┌─────────┐ │
│ │Istio│ │Keycloak │ │ │ │Istio│ │Keycloak │ │ │ │Istio│ │Keycloak │ │
│ └─────┘ └─────────┘ │ │ └─────┘ └─────────┘ │ │ └─────┘ └─────────┘ │
│ │ │ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ Observability │ │ │ │ Observability │ │ │ │ Observability │ │
│ │ Mimir/Loki/Tempo│ │ │ │ Mimir/Loki/Tempo│ │ │ │ Mimir/Loki/Tempo│ │
│ └─────────────────┘ │ │ └─────────────────┘ │ │ └─────────────────┘ │
└───────────────────────┘ └───────────────────────┘ └───────────────────────┘
Note: Global services reside in a dedicated Shared Services Account (
helpdev-org-main) following AWS multi-account best practices. See ADR-012.
Why this matters: Multi-region architecture ensures business continuity during regional outages, reduces latency for geographically distributed users, and provides true disaster recovery capabilities without complex cross-region dependencies.
Our multi-region architecture follows the Isolated Regions pattern:
┌─────────────────────────────────────────────────────────────────────────────┐
│ ISOLATED REGIONS ARCHITECTURE │
│ │
│ Each region is self-contained. No cross-region service communication. │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ us-east-1 │ │ sa-east-1 │ │
│ │ (Primary) │ ✕ │ (Secondary) │ │
│ │ │ No mesh │ │ │
│ │ ┌───────────────────┐ │ No svc │ ┌───────────────────┐ │ │
│ │ │ VPC + 3 AZs │ │ No data │ │ VPC + 3 AZs │ │ │
│ │ │ EKS Cluster │ │ │ │ EKS Cluster │ │ │
│ │ │ Keycloak + RDS │ │ │ │ Keycloak + RDS │ │ │
│ │ │ Mimir/Loki/Tempo │ │ │ │ Mimir/Loki/Tempo │ │ │
│ │ └───────────────────┘ │ │ └───────────────────┘ │ │
│ └────────────▲────────────┘ └────────────▲────────────┘ │
│ │ │ │
│ └───────────┬───────────────────────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ Route53 │ │
│ │ Latency-based │ │
│ │ routing │ │
│ └───────────────┘ │
│ ▲ │
│ │ │
│ End Users │
└─────────────────────────────────────────────────────────────────────────────┘
Benefits:
- True disaster recovery (region failure = automatic failover)
- No cross-region latency for service calls
- Simplified architecture (no mesh federation)
- Independent scaling per region
📖 Full Documentation: See PRD Section 3 - Solution Architecture for detailed specifications. Related ADRs: ADR-009, ADR-012.
Why this matters: Establishing core principles creates a shared foundation for all engineering decisions, ensuring consistency across teams and reducing technical debt by guiding choices before they become problems.
| Principle | Description |
|---|---|
| GitOps | Git is the single source of truth for all infrastructure and application state |
| Zero-Trust | All service-to-service communication requires mTLS authentication |
| Immutability | Container images are referenced by digest, not mutable tags |
| Isolation | Blast radius minimized through account, cluster, and namespace separation |
| Self-Service | Developers can create and manage services through Backstage portal |
Why this matters: A zero-trust security model protects against both external threats and internal lateral movement, ensuring that every request is authenticated and authorized regardless of its origin within the network.
┌─────────────────────────────────────────────────────────────────────────────┐
│ ZERO-TRUST SECURITY │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Identity │ │ Network │ │ Data │ │
│ │ │ │ │ │ │ │
│ │ • Keycloak OIDC │ │ • Istio mTLS │ │ • Secrets Manager│ │
│ │ • IRSA (AWS) │ │ • Network Policies│ │ • KMS Encryption │ │
│ │ • Service Tokens │ │ • VPC Isolation │ │ • RBAC Controls │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Authorization Flow │ │
│ │ | │
│ │ Service A ──────────────────────────────────────▶ Service B │ │
│ │ │ │ │ │
│ │ │ 1. Get JWT from Keycloak │ │ │
│ │ │ 2. Include token in request │ │ │
│ │ │ 3. Istio validates mTLS + JWT │ │ │
│ │ │ 4. AuthorizationPolicy checks roles │ │ │
│ │ │ 5. Request allowed or denied │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
📖 Full Documentation: See PRD Section 2 - Architecture Principles for detailed specifications. Related ADRs: ADR-003, ADR-004.
Why this matters: Standardizing on a curated technology stack reduces cognitive overhead, enables shared expertise across teams, simplifies troubleshooting, and ensures all components are battle-tested and well-integrated.
┌─────────────────────────────────────────────────────────────────────────────┐
│ TECHNOLOGY STACK │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Developer Experience │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Backstage │ │ GitHub │ │ GitHub │ │ TechDocs │ │ │
│ │ │ Portal │ │ Repos │ │ Actions │ │ │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ GitOps & Deployment │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Argo CD │ │ Helm │ │ Kustomize│ │ AppSets │ │ │
│ │ │ │ │ Charts │ │ │ │ │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Runtime & Service Mesh │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ EKS │ │ Istio │ │ KEDA │ │ External │ │ │
│ │ │ │ │ mTLS │ │ Autoscaler│ │ Secrets │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Observability │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Grafana │ │ Mimir │ │ Loki │ │ Tempo │ │ │
│ │ │ │ │ Metrics │ │ Logs │ │ Traces │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Identity & Security │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Keycloak │ │ AWS IAM │ │ Secrets │ │ KMS │ │ │
│ │ │ OIDC │ │ IRSA │ │ Manager │ │ │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ API Management │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Kong │ │ OpenAPI │ │ Rate │ │ JWT │ │ │
│ │ │ Gateway │ │ Specs │ │ Limiting │ │ Auth │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Why this matters: Pinning component versions ensures reproducible environments, prevents unexpected breaking changes, and allows coordinated upgrades with proper testing and rollback procedures.
| Component | Version | Purpose |
|---|---|---|
| Kubernetes | 1.29+ | Container orchestration |
| Istio | 1.20+ | Service mesh |
| Argo CD | 2.10+ | GitOps controller |
| Keycloak | 24+ | Identity provider |
| Grafana | 10+ | Dashboards & alerting |
| KEDA | 2.12+ | Event-driven autoscaling |
| Kong | 3.6+ | API Gateway (DB-less mode) |
📖 Full Documentation: See PRD Section 4 - Technology Stack for detailed specifications.
Why this matters: A streamlined developer experience accelerates time-to-market, reduces friction in adopting platform standards, and empowers teams to focus on business logic rather than infrastructure concerns.
┌─────────────────────────────────────────────────────────────────────────────┐
│ NEW SERVICE CREATION (< 10 min) │
│ │
│ ┌─────────┐ ┌──────────────────────────────────────────────┐ │
│ │Developer│ │ Backstage Portal │ │
│ └────┬────┘ │ │ │
│ │ │ ┌──────────────────────────────────────┐ │ │
│ │ 1. Fill │ │ Service Template Form │ │ │
│ │ ─────────▶ │ │ │ │ │
│ │ │ │ Name: [payment-api ] │ │ │
│ │ │ │ Owner: [payments-team ▼ ] │ │ │
│ │ │ │ Domain: [payments ▼ ] │ │ │
│ │ │ │ Language: [Java ▼ ] │ │ │
│ │ │ │ Regions: [☑ us-east-1 ☑ sa-east-1] │ │ │
│ │ │ │ │ │ │
│ │ │ │ [ Create Service ] │ │ │
│ │ │ └──────────────────────────────────────┘ │ │
│ │ └──────────────────────────────────────────────┘ │
│ │ │ │
│ │ 2. Automated Creation │
│ │ ▼ │
│ │ ┌──────────────────────────────────────────────┐ │
│ │ │ │ │
│ │ │ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ │ payment-api │ │ payment-api- │ │ │
│ │ │ │ (code repo) │ │ infra (GitOps) │ │ │
│ │ │ └────────────────┘ └────────────────┘ │ │
│ │ │ │ │
│ │ │ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ │ Argo CD App │ │ Keycloak Client│ │ │
│ │ │ │ (multi-region) │ │ (GitOps sync) │ │ │
│ │ │ └────────────────┘ └────────────────┘ │ │
│ │ │ │ │
│ │ └──────────────────────────────────────────────┘ │
│ │ │ │
│ │ 3. Start │ │
│ ◀────┴─ coding! ◀────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Each service follows a standardized two-repository pattern (code + infra) enabling independent versioning and clear ownership boundaries.
📖 Full Documentation: See PRD Section 5 - Repository Structure for detailed specifications.
Why this matters: A GitOps-based deployment pipeline provides auditability, reproducibility, and automatic drift detection, while phased rollouts minimize blast radius and enable safe multi-region deployments.
┌─────────────────────────────────────────────────────────────────────────────┐
│ GITOPS DEPLOYMENT FLOW │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ PR Open │────▶│ Build/Test │────▶│ Label: │────▶│ HML │ │
│ │ │ │ + Scan │ │ deploy-hml │ │ Deploy │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ Merge │────▶│ Release │────▶│ Phased │────▶│ PROD │ │
│ │ to main │ │ v1.2.3 │ │ Rollout │ │ Deploy │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └───────────┘ │
│ │
│ ═══════════════════════════════════════════════════════════════════════ │
│ │
│ PHASED MULTI-REGION ROLLOUT │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ us-east-1 │ │ sa-east-1 │ │
│ │ │ │ │ │
│ │ 1. Deploy │ 30min delay │ 2. Deploy │ │
│ │ 2. Health ✓ │─────────────────────▶│ 3. Health ✓ │ │
│ │ 3. Monitor │ │ 4. Complete │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
📖 Full Documentation: See PRD Section 5 - Repository Structure, Section 6 - Deploy Flow, and Section 4.7 - Developer Portal for detailed specifications. Related ADRs: ADR-001, ADR-007.
Why this matters: Comprehensive observability through metrics, logs, and traces enables rapid incident detection, root cause analysis, and data-driven capacity planning—essential for maintaining reliability at scale.
┌─────────────────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY ARCHITECTURE │
│ │
│ ┌─────────────────┐ │
│ │ Grafana Central │ │
│ │ (prod/use1) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌─────────────────────┼─────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ HML/use1 │ │ PROD/use1 │ │ PROD/sae1 │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ Mimir │ │ │ │ Mimir │ │ │ │ Mimir │ │ │
│ │ │ Metrics │ │ │ │ Metrics │ │ │ │ Metrics │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ Loki │ │ │ │ Loki │ │ │ │ Loki │ │ │
│ │ │ Logs │ │ │ │ Logs │ │ │ │ Logs │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ Tempo │ │ │ │ Tempo │ │ │ │ Tempo │ │ │
│ │ │ Traces │ │ │ │ Traces │ │ │ │ Traces │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Data Collection (per cluster) │ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ OTel Collector │ │ OTel Collector │ │ OTel Collector │ │ │
│ │ │ (Gateway) │ │ (Gateway) │ │ (Gateway) │ │ │
│ │ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ Services │ │ Services │ │ Services │ │ │
│ │ │ (auto-instr) │ │ (auto-instr) │ │ (auto-instr) │ │ │
│ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Why this matters: Purpose-built dashboards reduce mean time to detection (MTTD) by surfacing relevant information to the right audience, enabling faster response and better operational awareness.
| Dashboard | Purpose | Audience |
|---|---|---|
| Service Overview | Health, latency, error rate per service | Dev Teams |
| Infrastructure | Node, pod, resource utilization | Platform Team |
| Istio Mesh | Traffic flow, mTLS status, policies | Platform Team |
| Keycloak | Auth requests, token issuance, failures | Security Team |
| Business SLOs | SLI/SLO tracking, error budgets | All Teams |
📖 Full Documentation: See PRD Section 7 - Observability for detailed specifications. Related ADR: ADR-002.
Why this matters: A service mesh provides consistent security, observability, and traffic management across all services without requiring application code changes, enabling platform-wide policy enforcement.
┌─────────────────────────────────────────────────────────────────────────────┐
│ SERVICE MESH ARCHITECTURE │
│ │
│ External Traffic │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Istio │ │
│ │ Ingress │ │
│ │ Gateway │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Service Mesh │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ Service A │ │ Service B │ │ │
│ │ │ ┌───────────────┐ │ mTLS │ ┌───────────────┐ │ │ │
│ │ │ │ App Pod │ │◀──────▶│ │ App Pod │ │ │ │
│ │ │ │ ┌─────────┐ │ │ │ │ ┌─────────┐ │ │ │ │
│ │ │ │ │ Envoy │ │ │ │ │ │ Envoy │ │ │ │ │
│ │ │ │ │ Sidecar │ │ │ │ │ │ Sidecar │ │ │ │ │
│ │ │ │ └─────────┘ │ │ │ │ └─────────┘ │ │ │ │
│ │ │ └───────────────┘ │ │ └───────────────┘ │ │ │
│ │ └─────────────────────┘ └─────────────────────┘ │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────────────────────┐ │ │
│ │ │ Security Policies │ │ │
│ │ │ │ │ │
│ │ │ PeerAuthentication ──▶ mTLS STRICT (no plaintext) │ │ │
│ │ │ RequestAuthentication ──▶ JWT validation via Keycloak │ │ │
│ │ │ AuthorizationPolicy ──▶ Role-based access control │ │ │
│ │ │ │ │ │
│ │ └───────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Why this matters: Explicit service-to-service authorization ensures that only permitted services can communicate, preventing unauthorized access and limiting the blast radius of compromised components.
┌─────────────────────────────────────────────────────────────────────────────┐
│ SERVICE AUTHORIZATION FLOW │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Order │ │Keycloak │ │ Istio │ │ Payment │ │
│ │ Service │ │ │ │ Sidecar │ │ Service │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ │ 1. Get token │ │ │ │
│ │──────────────▶│ │ │ │
│ │ │ │ │ │
│ │◀──────────────│ │ │ │
│ │ JWT token │ │ │ │
│ │ │ │ │ │
│ │ 2. Call with token │ │ │
│ │───────────────────────────────▶ │ │
│ │ │ │ │
│ │ │ 3. Validate JWT │ │
│ │ │ Check roles │ │
│ │ │ Verify mTLS │ │
│ │ │◀ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ▶│ │
│ │ │ │ │ │
│ │ │ │ 4. Forward │ │
│ │ │ │───────────────▶ │
│ │ │ │ │ │
│ │ │ │◀──────────────│ │
│ │◀──────────────────────────────│ Response │ │
│ │ │ │ │ │
└─────────────────────────────────────────────────────────────────────────────┘
📖 Full Documentation: See PRD Section 8 - Service Mesh (Istio) and Section 11 - Security and Governance for detailed specifications. Related ADRs: ADR-004, ADR-005, ADR-008.
Why this matters: A centralized API management layer provides consistent authentication, rate limiting, and documentation across all services, enabling secure external API exposure while maintaining self-service capabilities for development teams.
Kong Gateway serves as the API management layer, running in DB-less mode with all configuration managed via GitOps. Kong instances are organized by business domain (e.g., ecommerce-external, payments-internal), allowing teams to manage their API configurations independently while the platform team maintains infrastructure.
┌─────────────────────────────────────────────────────────────────────────────┐
│ API MANAGEMENT FLOW │
│ │
│ External Traffic Internal Traffic │
│ │ │ │
│ ▼ │ │
│ ┌───────────┐ │ │
│ │ WAF │ │ │
│ │ (Shield) │ │ │
│ └─────┬─────┘ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Kong External │ │ Kong Internal │ │
│ │ (LoadBalancer) │ │ (ClusterIP) │ │
│ │ │ │ │ │
│ │ • JWT Auth │ │ • Routing │ │
│ │ • Rate Limit │ │ • Load Balance │ │
│ │ • Request Trans │ │ • Plugins │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │
│ └──────────────┬───────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Istio Service Mesh (mTLS) │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ checkout-api│ │ cart-api │ │ payment-api │ ◀── Envoy │ │
│ │ │ + sidecar │ │ + sidecar │ │ + sidecar │ Sidecars │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ GitOps Configuration │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Backstage │───▶│platform-apis│───▶│ Kong Runtime (DB-less) │ │ │
│ │ │ (Portal) │ │ (Git Repo) │ │ Config via ConfigMap │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
| Capability | Description |
|---|---|
| External API Exposure | Public APIs with WAF integration, rate limiting, and JWT authentication via Keycloak |
| Internal API Routing | Service-to-service communication within and across domains |
| Centralized OpenAPI Specs | API documentation automatically published to Developer Portal (Backstage) |
| Self-Service Configuration | Teams manage routes and plugins via PRs to platform-apis repository |
| Type | Network | Use Case |
|---|---|---|
| Internal | ClusterIP | APIs consumed within the cluster |
| External | LoadBalancer + WAF | APIs exposed to the internet |
📚 Deep Dive: See PRD Section 10 - API Management and ADR-013: Kong API Gateway for complete details on repository structure, configuration patterns, and Backstage integration.
Why this matters: Infrastructure as Code ensures reproducible, auditable, and version-controlled infrastructure changes, eliminating configuration drift and enabling disaster recovery through complete environment reconstruction.
┌─────────────────────────────────────────────────────────────────────────────┐
│ PLATFORM REPOSITORIES │
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Managed by Platform Team │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │platform-terraform│ │platform-argocd- │ │platform-helm- │ │ │
│ │ │ │ │config │ │charts │ │ │
│ │ │ • VPC/EKS/RDS │ │ • App-of-Apps │ │ • service-base │ │ │
│ │ │ • IAM/KMS │ │ • ApplicationSets│ │ • cronjob-base │ │ │
│ │ │ • Per region │ │ • Per cluster │ │ • Shared helpers │ │ │
│ │ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │platform- │ │platform-keycloak │ │platform-pipelines│ │ │
│ │ │observability │ │ │ │ │ │ │
│ │ │ • Grafana stack │ │ • Realm configs │ │ • CI/CD templates│ │ │
│ │ │ • Dashboards │ │ • Clients (JSON) │ │ • Actions │ │ │
│ │ │ • Alerts │ │ • Per account │ │ • Phased rollout │ │ │
│ │ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────────┐ │
│ │ Managed by Application Teams │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ service-xyz │ │ service-xyz-infra│ (created via │ │
│ │ │ │ │ │ Backstage) │ │
│ │ │ • Source code │ │ • Helm values │ │ │
│ │ │ • Dockerfile │ │ • Istio policies │ │ │
│ │ │ • CI pipeline │ │ • Per region │ │ │
│ │ └──────────────────┘ └──────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Why this matters: A consistent environment structure across HML and production ensures parity, reduces environment-specific bugs, and enables predictable deployments with region-specific configurations.
environments/
├── hml/
│ └── us-east-1/
│ ├── values.yaml # HML-specific config
│ └── istio/ # Istio policies
│ └── authorization-policy.yaml
└── prod/
├── us-east-1/ # Primary region
│ ├── values.yaml
│ └── istio/
└── sa-east-1/ # Secondary region
├── values.yaml
└── istio/
📖 Full Documentation: See PRD Section 4.1 - Infrastructure and Section 5 - Repository Structure for detailed specifications. Related ADR: ADR-001.
Why this matters: Defining clear KPIs and SLOs establishes measurable reliability targets, enables objective decision-making around technical investments, and creates accountability for platform health.
| Metric | Target |
|---|---|
| Deployment Frequency | ≥ 10/day/service |
| Lead Time for Changes | < 1 hour |
| Mean Time to Recovery (MTTR) | < 30 minutes |
| Change Failure Rate | < 5% |
| Platform Uptime | 99.9% |
| Region Failover Time | < 5 minutes |
Why this matters: Tiered service classification allows appropriate resource allocation, sets realistic expectations per service criticality, and prevents over-engineering of non-critical components.
| Tier | Availability | Error Rate | Latency P99 | Examples |
|---|---|---|---|---|
| Tier 1 | 99.99% | < 0.01% | < 100ms | Payment, Auth |
| Tier 2 | 99.9% | < 0.1% | < 500ms | Orders, Users |
| Tier 3 | 99.5% | < 1% | < 2s | Reports, Analytics |
📖 Full Documentation: See PRD Section 15 - Success Metrics for detailed specifications.
Why this matters: Designing for resilience ensures the platform can withstand and recover from failures gracefully, protecting revenue and user trust during inevitable infrastructure incidents.
| Scenario | RTO | Strategy |
|---|---|---|
| Pod crash | < 30s | Kubernetes auto-restart |
| Node failure | < 5min | Pod rescheduling + PDB |
| AZ failure | < 15min | Multi-AZ node groups |
| Region failure | < 5min | Route53 failover |
Why this matters: Explicit HA configuration ensures services remain available during maintenance, deployments, and partial failures by distributing workloads and maintaining minimum healthy replicas.
┌─────────────────────────────────────────────────────────────────────────────┐
│ HIGH AVAILABILITY DESIGN │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Per Service │ │
│ │ │ │
│ │ • replicas: 3 (minimum for production) │ │
│ │ • PodDisruptionBudget: minAvailable: 2 │ │
│ │ • TopologySpreadConstraints: spread across AZs │ │
│ │ • Liveness/Readiness probes configured │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Per Cluster/Region │ │
│ │ │ │
│ │ • 3 Availability Zones │ │
│ │ • Node groups spread across AZs │ │
│ │ • RDS Multi-AZ (for Keycloak) │ │
│ │ • S3 cross-region replication (for observability) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Global │ │
│ │ │ │
│ │ • Route53 latency-based routing │ │
│ │ • Route53 health checks with failover │ │
│ │ • ECR in primary region (cross-region pull) │ │
│ │ • Keycloak config sync via GitOps │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
📖 Full Documentation: See PRD Section 12 - Resilience and High Availability for detailed specifications. Related ADRs: ADR-006, ADR-009.
Why this matters: Self-service resource provisioning eliminates bottlenecks on platform teams, accelerates development velocity, and ensures resources are created with consistent security and compliance standards.
Application teams can provision AWS resources (SQS, RDS, S3, etc.) via self-service in Backstage.
┌─────────────────────────────────────────────────────────────────────────────┐
│ APPLICATION RESOURCE PROVISIONING │
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Backstage Portal │ │
│ │ │ │
│ │ Developer fills form: │ │
│ │ • Service: payment-api │ │
│ │ • Type: SQS │ │
│ │ • Name: order-events │ │
│ │ • Regions: us-east-1, sa-east-1 │ │
│ │ │ │
│ └────────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ GitHub Pull Request │ │
│ │ │ │
│ │ platform-terraform-aws-projects-resources/ │ │
│ │ └── helpdev-prod/prod/us-east-1/sqs/payment-api/order-events.tf │ │
│ │ │ │
│ │ Uses versioned module: │ │
│ │ source = "...platform-terraform-aws-modules//modules/sqs?ref=v1.2.0" │ │
│ │ │ │
│ └────────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ┌──────────────┴──────────────┐ │
│ │ Approval │ │
│ │ HML: 1 team member │ │
│ │ PROD: 2 (+ Platform Team) │ │
│ └──────────────┬──────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ GitHub Actions │ │
│ │ │ │
│ │ terraform plan → terraform apply → Store outputs in Secrets Manager │ │
│ │ │ │
│ └────────────────────────────────┬──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Service Consumption │ │
│ │ │ │
│ │ AWS Secrets Manager → External Secrets Operator → K8s Secret → Pod │ │
│ │ │ │
│ │ env: │ │
│ │ - name: SQS_ORDER_EVENTS_URL │ │
│ │ valueFrom: │ │
│ │ secretKeyRef: │ │
│ │ name: payment-api-resources │ │
│ │ key: sqs-order-events-url │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
| Resource | Module | Use Cases |
|---|---|---|
| SQS | sqs/ |
Message queues, DLQ |
| SNS | sns/ |
Pub/sub, notifications |
| RDS | rds/ |
PostgreSQL, MySQL databases |
| S3 | s3/ |
Object storage, backups |
| DynamoDB | dynamodb/ |
NoSQL, sessions |
| ElastiCache | elasticache/ |
Redis, Memcached |
Resources are organized in versioned Terraform modules with per-service definitions following account/env/region hierarchy.
📖 Full Documentation: See PRD Section 5 - Repository Structure and Section 4.7 - Developer Portal for detailed specifications. Related ADR: ADR-011.
Why this matters: ADRs capture the context and rationale behind architectural choices, preventing repeated discussions, enabling informed future decisions, and preserving institutional knowledge as team members change.
Key architectural decisions are documented as ADRs:
| ADR | Decision |
|---|---|
| ADR-001 | Distributed Argo CD (per cluster) |
| ADR-002 | Federated observability with central visualization |
| ADR-003 | Container images by digest |
| ADR-004 | Istio for service mesh |
| ADR-005 | Keycloak as identity provider |
| ADR-006 | Keycloak per environment |
| ADR-007 | Backstage as developer portal |
| ADR-008 | AWS Secrets Manager + External Secrets |
| ADR-009 | Isolated multi-region architecture |
| ADR-010 | Keycloak GitOps sync strategy |
| ADR-011 | Application Resources via Self-Service |
| ADR-012 | Dedicated Shared Services Account |
📖 Full Documentation: See PRD Section 16 - Architecture Decisions (ADRs) for detailed specifications.
Why this matters: Consistent naming conventions and quick reference materials reduce cognitive load, enable automation, and ensure resources can be easily identified and managed across environments and teams.
| Resource | Pattern | Example |
|---|---|---|
| Cluster | eks-{env}-{region_short} |
eks-prod-use1 |
| Namespace | {service-name} |
payment-api |
| Repository | service-{name} / service-{name}-infra |
service-payment-api |
| Secret Path | helpdev/{env}/{region}/{domain}/{service}/{type} |
helpdev/prod/us-east-1/payments/checkout/database |
| AWS Region | Short Code |
|---|---|
| us-east-1 | use1 |
| sa-east-1 | sae1 |
| eu-west-1 | euw1 |
Full Technical Specification: See PRD-kubernetes-multi-account-aws.md