HelpDev'Ops Enterprise Infrastructure Stack - Architecture Overview

High-Level Engineering Standards & Architecture Guide

Version: 1.0 | Last Updated: 2026-01-25 | Author: Guilherme Biff Zarelli

Executive Summary

A production-ready, cloud-native platform engineered for enterprises that need to scale fast without sacrificing security, reliability, or developer productivity. Built for teams that refuse to choose between moving quickly and building things right.

Why This Platform Exists

Modern enterprises face a fundamental challenge: scaling infrastructure and teams simultaneously while maintaining governance and operational excellence. This platform solves that by providing a self-service foundation where developers ship features—not fight infrastructure—while platform teams maintain control, visibility, and compliance.

Four Pillars

Pillar	What You Get
🚀 Developer Experience	Self-service deployments, automated scaffolding via Developer Portal, zero-friction onboarding. New services go from idea to production in hours, not weeks.
📈 Scalability & Resilience	Multi-region architecture with automatic failover, auto-scaling workloads, and zero-downtime deployments. Built to handle traffic spikes and regional outages gracefully.
🔐 Security & Compliance	Zero-trust networking with mTLS everywhere, centralized identity management, and policy-as-code. Security is built-in, not bolted-on.
⚙️ Operational Excellence	GitOps-driven workflows, unified observability (metrics, logs, traces), and cost transparency per team. Everything is auditable, reproducible, and automated.

Technology Foundation

Aspect	Approach
Infrastructure	AWS Multi-Account (HML + Prod), Multi-Region
Orchestration	Amazon EKS with GitOps (Argo CD)
Service Mesh	Istio with mTLS (zero-trust)
Identity	Keycloak (OIDC/OAuth2)
Observability	Grafana Stack (Mimir, Loki, Tempo)
Developer Portal	Backstage with automated scaffolding
Deployment	GitOps with phased multi-region rollout

Business Impact

This platform enables organizations to accelerate time-to-market while reducing operational overhead. Teams onboard in days, deploy with confidence, and operate with full visibility—allowing the business to grow without infrastructure becoming a bottleneck.

1. Architecture Overview

Why this matters: A clear architectural overview enables teams to understand the system's structure at a glance, facilitating faster onboarding, better decision-making, and alignment across all stakeholders on how components interact.

1.1 High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                 SHARED SERVICES ACCOUNT (helpdev-org-main)                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Route53    │  │ CloudFront  │  │    ECR      │  │  Grafana    │         │
│  │ (DNS/LB)    │  │   (CDN)     │  │ (us-east-1) │  │  Central    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘         │
│                          Cross-Account Access (IAM + Resource Policies)      │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
            ┌─────────────────────────┼─────────────────────────┐
            ▼                         ▼                         ▼
┌───────────────────────┐  ┌───────────────────────┐  ┌───────────────────────┐
│   AWS Account: HML    │  │  AWS Account: PROD    │  │  AWS Account: PROD    │
│   Region: us-east-1   │  │  Region: us-east-1    │  │  Region: sa-east-1    │
│                       │  │      (Primary)        │  │     (Secondary)       │
│  ┌─────────────────┐  │  │  ┌─────────────────┐  │  │  ┌─────────────────┐  │
│  │  EKS Cluster    │  │  │  │  EKS Cluster    │  │  │  │  EKS Cluster    │  │
│  │  eks-hml-use1   │  │  │  │  eks-prod-use1  │  │  │  │  eks-prod-sae1  │  │
│  └─────────────────┘  │  │  └─────────────────┘  │  │  └─────────────────┘  │
│                       │  │                       │  │                       │
│  ┌─────┐ ┌─────────┐  │  │  ┌─────┐ ┌─────────┐  │  │  ┌─────┐ ┌─────────┐  │
│  │Istio│ │Keycloak │  │  │  │Istio│ │Keycloak │  │  │  │Istio│ │Keycloak │  │
│  └─────┘ └─────────┘  │  │  └─────┘ └─────────┘  │  │  └─────┘ └─────────┘  │
│                       │  │                       │  │                       │
│  ┌─────────────────┐  │  │  ┌─────────────────┐  │  │  ┌─────────────────┐  │
│  │  Observability  │  │  │  │  Observability  │  │  │  │  Observability  │  │
│  │ Mimir/Loki/Tempo│  │  │  │ Mimir/Loki/Tempo│  │  │  │ Mimir/Loki/Tempo│  │
│  └─────────────────┘  │  │  └─────────────────┘  │  │  └─────────────────┘  │
└───────────────────────┘  └───────────────────────┘  └───────────────────────┘

Note: Global services reside in a dedicated Shared Services Account (helpdev-org-main) following AWS multi-account best practices. See ADR-012.

1.2 Multi-Region Strategy

Why this matters: Multi-region architecture ensures business continuity during regional outages, reduces latency for geographically distributed users, and provides true disaster recovery capabilities without complex cross-region dependencies.

Our multi-region architecture follows the Isolated Regions pattern:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         ISOLATED REGIONS ARCHITECTURE                       │
│                                                                             │
│    Each region is self-contained. No cross-region service communication.    │
│                                                                             │
│   ┌─────────────────────────┐         ┌─────────────────────────┐           │
│   │     us-east-1           │         │     sa-east-1           │           │
│   │     (Primary)           │    ✕    │     (Secondary)         │           │
│   │                         │ No mesh │                         │           │
│   │  ┌───────────────────┐  │ No svc  │  ┌───────────────────┐  │           │
│   │  │ VPC + 3 AZs       │  │ No data │  │ VPC + 3 AZs       │  │           │
│   │  │ EKS Cluster       │  │         │  │ EKS Cluster       │  │           │
│   │  │ Keycloak + RDS    │  │         │  │ Keycloak + RDS    │  │           │
│   │  │ Mimir/Loki/Tempo  │  │         │  │ Mimir/Loki/Tempo  │  │           │
│   │  └───────────────────┘  │         │  └───────────────────┘  │           │
│   └────────────▲────────────┘         └────────────▲────────────┘           │
│                │                                   │                        │
│                └───────────┬───────────────────────┘                        │
│                            │                                                │
│                    ┌───────▼───────┐                                        │
│                    │   Route53     │                                        │
│                    │ Latency-based │                                        │
│                    │   routing     │                                        │
│                    └───────────────┘                                        │
│                            ▲                                                │
│                            │                                                │
│                       End Users                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Benefits:

True disaster recovery (region failure = automatic failover)
No cross-region latency for service calls
Simplified architecture (no mesh federation)
Independent scaling per region

📖 Full Documentation: See PRD Section 3 - Solution Architecture for detailed specifications. Related ADRs: ADR-009, ADR-012.

2. Core Principles

Why this matters: Establishing core principles creates a shared foundation for all engineering decisions, ensuring consistency across teams and reducing technical debt by guiding choices before they become problems.

2.1 Engineering Principles

Principle	Description
GitOps	Git is the single source of truth for all infrastructure and application state
Zero-Trust	All service-to-service communication requires mTLS authentication
Immutability	Container images are referenced by digest, not mutable tags
Isolation	Blast radius minimized through account, cluster, and namespace separation
Self-Service	Developers can create and manage services through Backstage portal

2.2 Security Model

Why this matters: A zero-trust security model protects against both external threats and internal lateral movement, ensuring that every request is authenticated and authorized regardless of its origin within the network.

┌─────────────────────────────────────────────────────────────────────────────┐
│                            ZERO-TRUST SECURITY                              │
│                                                                             │
│  ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐       │
│  │    Identity      │    │    Network       │    │    Data          │       │
│  │                  │    │                  │    │                  │       │
│  │ • Keycloak OIDC  │    │ • Istio mTLS     │    │ • Secrets Manager│       │
│  │ • IRSA (AWS)     │    │ • Network Policies│   │ • KMS Encryption │       │
│  │ • Service Tokens │    │ • VPC Isolation  │    │ • RBAC Controls  │       │
│  └──────────────────┘    └──────────────────┘    └──────────────────┘       │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────┐       │
│  │                    Authorization Flow                            │       │
│  │                                                                  |       │
│  │   Service A ──────────────────────────────────────▶ Service B    │       │
│  │       │                                                 │        │       │
│  │       │ 1. Get JWT from Keycloak                        │        │       │
│  │       │ 2. Include token in request                     │        │       │
│  │       │ 3. Istio validates mTLS + JWT                   │        │       │
│  │       │ 4. AuthorizationPolicy checks roles             │        │       │
│  │       │ 5. Request allowed or denied                    │        │       │
│  └──────────────────────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────────────────┘

📖 Full Documentation: See PRD Section 2 - Architecture Principles for detailed specifications. Related ADRs: ADR-003, ADR-004.

3. Technology Stack

Why this matters: Standardizing on a curated technology stack reduces cognitive overhead, enables shared expertise across teams, simplifies troubleshooting, and ensures all components are battle-tested and well-integrated.

3.1 Platform Components

┌─────────────────────────────────────────────────────────────────────────────┐
│                           TECHNOLOGY STACK                                  │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        Developer Experience                         │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │ Backstage │  │  GitHub   │  │  GitHub   │  │ TechDocs  │         │    │
│  │  │  Portal   │  │   Repos   │  │  Actions  │  │           │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         GitOps & Deployment                         │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │  Argo CD  │  │   Helm    │  │  Kustomize│  │ AppSets   │         │    │
│  │  │           │  │  Charts   │  │           │  │           │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Runtime & Service Mesh                         │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │    EKS    │  │   Istio   │  │   KEDA    │  │  External │         │    │
│  │  │           │  │   mTLS    │  │ Autoscaler│  │  Secrets  │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Observability                               │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │  Grafana  │  │   Mimir   │  │   Loki    │  │   Tempo   │         │    │
│  │  │           │  │  Metrics  │  │   Logs    │  │  Traces   │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Identity & Security                            │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │ Keycloak  │  │  AWS IAM  │  │  Secrets  │  │    KMS    │         │    │
│  │  │   OIDC    │  │   IRSA    │  │  Manager  │  │           │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                       API Management                                │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │   Kong    │  │  OpenAPI  │  │   Rate    │  │    JWT    │         │    │
│  │  │  Gateway  │  │   Specs   │  │  Limiting │  │   Auth    │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

3.2 Version Requirements

Why this matters: Pinning component versions ensures reproducible environments, prevents unexpected breaking changes, and allows coordinated upgrades with proper testing and rollback procedures.

Component	Version	Purpose
Kubernetes	1.29+	Container orchestration
Istio	1.20+	Service mesh
Argo CD	2.10+	GitOps controller
Keycloak	24+	Identity provider
Grafana	10+	Dashboards & alerting
KEDA	2.12+	Event-driven autoscaling
Kong	3.6+	API Gateway (DB-less mode)

📖 Full Documentation: See PRD Section 4 - Technology Stack for detailed specifications.

4. Developer Experience

Why this matters: A streamlined developer experience accelerates time-to-market, reduces friction in adopting platform standards, and empowers teams to focus on business logic rather than infrastructure concerns.

4.1 Service Creation Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                      NEW SERVICE CREATION (< 10 min)                         │
│                                                                              │
│   ┌─────────┐         ┌──────────────────────────────────────────────┐      │
│   │Developer│         │              Backstage Portal                │      │
│   └────┬────┘         │                                              │      │
│        │              │  ┌──────────────────────────────────────┐    │      │
│        │  1. Fill     │  │         Service Template Form        │    │      │
│        │  ─────────▶  │  │                                      │    │      │
│        │              │  │  Name:     [payment-api           ]  │    │      │
│        │              │  │  Owner:    [payments-team ▼       ]  │    │      │
│        │              │  │  Domain:   [payments ▼            ]  │    │      │
│        │              │  │  Language: [Java ▼                ]  │    │      │
│        │              │  │  Regions:  [☑ us-east-1 ☑ sa-east-1] │    │      │
│        │              │  │                                      │    │      │
│        │              │  │         [ Create Service ]           │    │      │
│        │              │  └──────────────────────────────────────┘    │      │
│        │              └──────────────────────────────────────────────┘      │
│        │                                    │                               │
│        │                         2. Automated Creation                      │
│        │                                    ▼                               │
│        │              ┌──────────────────────────────────────────────┐      │
│        │              │                                              │      │
│        │              │  ┌────────────────┐  ┌────────────────┐      │      │
│        │              │  │ payment-api    │  │ payment-api-   │      │      │
│        │              │  │ (code repo)    │  │ infra (GitOps) │      │      │
│        │              │  └────────────────┘  └────────────────┘      │      │
│        │              │                                              │      │
│        │              │  ┌────────────────┐  ┌────────────────┐      │      │
│        │              │  │ Argo CD App    │  │ Keycloak Client│      │      │
│        │              │  │ (multi-region) │  │ (GitOps sync)  │      │      │
│        │              │  └────────────────┘  └────────────────┘      │      │
│        │              │                                              │      │
│        │              └──────────────────────────────────────────────┘      │
│        │                                    │                                │
│        │  3. Start                          │                                │
│   ◀────┴─ coding!  ◀────────────────────────┘                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

4.2 Repository Structure

Each service follows a standardized two-repository pattern (code + infra) enabling independent versioning and clear ownership boundaries.

📖 Full Documentation: See PRD Section 5 - Repository Structure for detailed specifications.

4.3 Deployment Pipeline

Why this matters: A GitOps-based deployment pipeline provides auditability, reproducibility, and automatic drift detection, while phased rollouts minimize blast radius and enable safe multi-region deployments.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           GITOPS DEPLOYMENT FLOW                            │
│                                                                             │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌───────────┐  │
│  │   PR Open   │────▶│ Build/Test  │────▶│  Label:     │────▶│  HML      │  │
│  │             │     │  + Scan     │     │ deploy-hml  │     │  Deploy   │  │
│  └─────────────┘     └─────────────┘     └─────────────┘     └───────────┘  │
│                                                                     │       │
│                                                                     ▼       │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌───────────┐  │
│  │   Merge     │────▶│  Release    │────▶│   Phased    │────▶│  PROD     │  │
│  │   to main   │     │   v1.2.3    │     │   Rollout   │     │  Deploy   │  │
│  └─────────────┘     └─────────────┘     └─────────────┘     └───────────┘  │
│                                                                             │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                             │
│                         PHASED MULTI-REGION ROLLOUT                         │
│                                                                             │
│      ┌─────────────────┐                      ┌─────────────────┐           │
│      │   us-east-1     │                      │   sa-east-1     │           │
│      │                 │                      │                 │           │
│      │  1. Deploy      │    30min delay       │  2. Deploy      │           │
│      │  2. Health ✓    │─────────────────────▶│  3. Health ✓    │           │
│      │  3. Monitor     │                      │  4. Complete    │           │
│      └─────────────────┘                      └─────────────────┘           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

📖 Full Documentation: See PRD Section 5 - Repository Structure, Section 6 - Deploy Flow, and Section 4.7 - Developer Portal for detailed specifications. Related ADRs: ADR-001, ADR-007.

5. Observability

Why this matters: Comprehensive observability through metrics, logs, and traces enables rapid incident detection, root cause analysis, and data-driven capacity planning—essential for maintaining reliability at scale.

5.1 Three Pillars Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         OBSERVABILITY ARCHITECTURE                           │
│                                                                              │
│                           ┌─────────────────┐                                │
│                           │ Grafana Central │                                │
│                           │  (prod/use1)    │                                │
│                           └────────┬────────┘                                │
│                                    │                                         │
│              ┌─────────────────────┼─────────────────────┐                   │
│              │                     │                     │                   │
│              ▼                     ▼                     ▼                   │
│   ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐          │
│   │    HML/use1      │  │   PROD/use1      │  │   PROD/sae1      │          │
│   │                  │  │                  │  │                  │          │
│   │ ┌──────────────┐ │  │ ┌──────────────┐ │  │ ┌──────────────┐ │          │
│   │ │    Mimir     │ │  │ │    Mimir     │ │  │ │    Mimir     │ │          │
│   │ │   Metrics    │ │  │ │   Metrics    │ │  │ │   Metrics    │ │          │
│   │ └──────────────┘ │  │ └──────────────┘ │  │ └──────────────┘ │          │
│   │ ┌──────────────┐ │  │ ┌──────────────┐ │  │ ┌──────────────┐ │          │
│   │ │    Loki      │ │  │ │    Loki      │ │  │ │    Loki      │ │          │
│   │ │    Logs      │ │  │ │    Logs      │ │  │ │    Logs      │ │          │
│   │ └──────────────┘ │  │ └──────────────┘ │  │ └──────────────┘ │          │
│   │ ┌──────────────┐ │  │ ┌──────────────┐ │  │ ┌──────────────┐ │          │
│   │ │    Tempo     │ │  │ │    Tempo     │ │  │ │    Tempo     │ │          │
│   │ │   Traces     │ │  │ │   Traces     │ │  │ │   Traces     │ │          │
│   │ └──────────────┘ │  │ └──────────────┘ │  │ └──────────────┘ │          │
│   └──────────────────┘  └──────────────────┘  └──────────────────┘          │
│                                                                              │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                      Data Collection (per cluster)                    │  │
│   │                                                                       │  │
│   │    ┌────────────────┐    ┌────────────────┐    ┌────────────────┐    │  │
│   │    │ OTel Collector │    │ OTel Collector │    │ OTel Collector │    │  │
│   │    │   (Gateway)    │    │   (Gateway)    │    │   (Gateway)    │    │  │
│   │    └───────┬────────┘    └───────┬────────┘    └───────┬────────┘    │  │
│   │            │                     │                     │             │  │
│   │            ▼                     ▼                     ▼             │  │
│   │    ┌────────────────┐    ┌────────────────┐    ┌────────────────┐    │  │
│   │    │   Services     │    │   Services     │    │   Services     │    │  │
│   │    │ (auto-instr)   │    │ (auto-instr)   │    │ (auto-instr)   │    │  │
│   │    └────────────────┘    └────────────────┘    └────────────────┘    │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

5.2 Key Dashboards

Why this matters: Purpose-built dashboards reduce mean time to detection (MTTD) by surfacing relevant information to the right audience, enabling faster response and better operational awareness.

Dashboard	Purpose	Audience
Service Overview	Health, latency, error rate per service	Dev Teams
Infrastructure	Node, pod, resource utilization	Platform Team
Istio Mesh	Traffic flow, mTLS status, policies	Platform Team
Keycloak	Auth requests, token issuance, failures	Security Team
Business SLOs	SLI/SLO tracking, error budgets	All Teams

📖 Full Documentation: See PRD Section 7 - Observability for detailed specifications. Related ADR: ADR-002.

6. Service Mesh & Security

Why this matters: A service mesh provides consistent security, observability, and traffic management across all services without requiring application code changes, enabling platform-wide policy enforcement.

6.1 Istio Traffic Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                          SERVICE MESH ARCHITECTURE                           │
│                                                                              │
│   External Traffic                                                           │
│         │                                                                    │
│         ▼                                                                    │
│   ┌─────────────┐                                                           │
│   │   Istio     │                                                           │
│   │  Ingress    │                                                           │
│   │  Gateway    │                                                           │
│   └──────┬──────┘                                                           │
│          │                                                                   │
│          ▼                                                                   │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                          Service Mesh                                 │  │
│   │                                                                       │  │
│   │  ┌─────────────────────┐        ┌─────────────────────┐              │  │
│   │  │     Service A       │        │     Service B       │              │  │
│   │  │  ┌───────────────┐  │  mTLS  │  ┌───────────────┐  │              │  │
│   │  │  │   App Pod     │  │◀──────▶│  │   App Pod     │  │              │  │
│   │  │  │  ┌─────────┐  │  │        │  │  ┌─────────┐  │  │              │  │
│   │  │  │  │ Envoy   │  │  │        │  │  │ Envoy   │  │  │              │  │
│   │  │  │  │ Sidecar │  │  │        │  │  │ Sidecar │  │  │              │  │
│   │  │  │  └─────────┘  │  │        │  │  └─────────┘  │  │              │  │
│   │  │  └───────────────┘  │        │  └───────────────┘  │              │  │
│   │  └─────────────────────┘        └─────────────────────┘              │  │
│   │                                                                       │  │
│   │  ┌───────────────────────────────────────────────────────────────┐   │  │
│   │  │                      Security Policies                         │   │  │
│   │  │                                                                │   │  │
│   │  │   PeerAuthentication ──▶ mTLS STRICT (no plaintext)           │   │  │
│   │  │   RequestAuthentication ──▶ JWT validation via Keycloak       │   │  │
│   │  │   AuthorizationPolicy ──▶ Role-based access control           │   │  │
│   │  │                                                                │   │  │
│   │  └───────────────────────────────────────────────────────────────┘   │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

6.2 Service-to-Service Authorization

Why this matters: Explicit service-to-service authorization ensures that only permitted services can communicate, preventing unauthorized access and limiting the blast radius of compromised components.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SERVICE AUTHORIZATION FLOW                               │
│                                                                             │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐                │
│  │ Order   │     │Keycloak │     │ Istio   │     │ Payment │                │
│  │ Service │     │         │     │ Sidecar │     │ Service │                │
│  └────┬────┘     └────┬────┘     └────┬────┘     └────┬────┘                │
│       │               │               │               │                     │
│       │ 1. Get token  │               │               │                     │
│       │──────────────▶│               │               │                     │
│       │               │               │               │                     │
│       │◀──────────────│               │               │                     │
│       │   JWT token   │               │               │                     │
│       │               │               │               │                     │
│       │ 2. Call with token            │               │                     │
│       │───────────────────────────────▶               │                     │
│       │                               │               │                     │
│       │               │     3. Validate JWT           │                     │
│       │               │     Check roles               │                     │
│       │               │     Verify mTLS               │                     │
│       │               │◀ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ▶│                     │
│       │               │               │               │                     │
│       │               │               │ 4. Forward    │                     │
│       │               │               │───────────────▶                     │
│       │               │               │               │                     │
│       │               │               │◀──────────────│                     │
│       │◀──────────────────────────────│  Response     │                     │
│       │               │               │               │                     │
└─────────────────────────────────────────────────────────────────────────────┘

📖 Full Documentation: See PRD Section 8 - Service Mesh (Istio) and Section 11 - Security and Governance for detailed specifications. Related ADRs: ADR-004, ADR-005, ADR-008.

7. API Management

Why this matters: A centralized API management layer provides consistent authentication, rate limiting, and documentation across all services, enabling secure external API exposure while maintaining self-service capabilities for development teams.

Kong Gateway serves as the API management layer, running in DB-less mode with all configuration managed via GitOps. Kong instances are organized by business domain (e.g., ecommerce-external, payments-internal), allowing teams to manage their API configurations independently while the platform team maintains infrastructure.

7.1 Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           API MANAGEMENT FLOW                               │
│                                                                             │
│   External Traffic                          Internal Traffic                │
│         │                                         │                         │
│         ▼                                         │                         │
│   ┌───────────┐                                   │                         │
│   │    WAF    │                                   │                         │
│   │  (Shield) │                                   │                         │
│   └─────┬─────┘                                   │                         │
│         ▼                                         ▼                         │
│   ┌─────────────────┐                    ┌─────────────────┐                │
│   │ Kong External   │                    │ Kong Internal   │                │
│   │ (LoadBalancer)  │                    │ (ClusterIP)     │                │
│   │                 │                    │                 │                │
│   │ • JWT Auth      │                    │ • Routing       │                │
│   │ • Rate Limit    │                    │ • Load Balance  │                │
│   │ • Request Trans │                    │ • Plugins       │                │
│   └────────┬────────┘                    └────────┬────────┘                │
│            │                                      │                         │
│            └──────────────┬───────────────────────┘                         │
│                           ▼                                                 │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                    Istio Service Mesh (mTLS)                        │   │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │   │
│   │  │ checkout-api│  │  cart-api   │  │ payment-api │  ◀── Envoy      │   │
│   │  │ + sidecar   │  │ + sidecar   │  │ + sidecar   │      Sidecars   │   │
│   │  └─────────────┘  └─────────────┘  └─────────────┘                  │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                      GitOps Configuration                           │   │
│   │                                                                     │   │
│   │  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │   │
│   │  │  Backstage  │───▶│platform-apis│───▶│ Kong Runtime (DB-less) │  │   │
│   │  │  (Portal)   │    │ (Git Repo)  │    │ Config via ConfigMap   │  │   │
│   │  └─────────────┘    └─────────────┘    └─────────────────────────┘  │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

7.2 Key Capabilities

Capability	Description
External API Exposure	Public APIs with WAF integration, rate limiting, and JWT authentication via Keycloak
Internal API Routing	Service-to-service communication within and across domains
Centralized OpenAPI Specs	API documentation automatically published to Developer Portal (Backstage)
Self-Service Configuration	Teams manage routes and plugins via PRs to `platform-apis` repository

7.3 Instance Types

Type	Network	Use Case
Internal	ClusterIP	APIs consumed within the cluster
External	LoadBalancer + WAF	APIs exposed to the internet

📚 Deep Dive: See PRD Section 10 - API Management and ADR-013: Kong API Gateway for complete details on repository structure, configuration patterns, and Backstage integration.

8. Infrastructure as Code

Why this matters: Infrastructure as Code ensures reproducible, auditable, and version-controlled infrastructure changes, eliminating configuration drift and enabling disaster recovery through complete environment reconstruction.

8.1 Repository Organization

┌─────────────────────────────────────────────────────────────────────────────┐
│                         PLATFORM REPOSITORIES                                │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                        Managed by Platform Team                         │ │
│  │                                                                         │ │
│  │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐      │ │
│  │  │platform-terraform│  │platform-argocd-  │  │platform-helm-    │      │ │
│  │  │                  │  │config            │  │charts            │      │ │
│  │  │ • VPC/EKS/RDS    │  │ • App-of-Apps    │  │ • service-base   │      │ │
│  │  │ • IAM/KMS        │  │ • ApplicationSets│  │ • cronjob-base   │      │ │
│  │  │ • Per region     │  │ • Per cluster    │  │ • Shared helpers │      │ │
│  │  └──────────────────┘  └──────────────────┘  └──────────────────┘      │ │
│  │                                                                         │ │
│  │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐      │ │
│  │  │platform-         │  │platform-keycloak │  │platform-pipelines│      │ │
│  │  │observability     │  │                  │  │                  │      │ │
│  │  │ • Grafana stack  │  │ • Realm configs  │  │ • CI/CD templates│      │ │
│  │  │ • Dashboards     │  │ • Clients (JSON) │  │ • Actions        │      │ │
│  │  │ • Alerts         │  │ • Per account    │  │ • Phased rollout │      │ │
│  │  └──────────────────┘  └──────────────────┘  └──────────────────┘      │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                     Managed by Application Teams                        │ │
│  │                                                                         │ │
│  │  ┌──────────────────┐  ┌──────────────────┐                            │ │
│  │  │ service-xyz      │  │ service-xyz-infra│     (created via           │ │
│  │  │                  │  │                  │      Backstage)            │ │
│  │  │ • Source code    │  │ • Helm values    │                            │ │
│  │  │ • Dockerfile     │  │ • Istio policies │                            │ │
│  │  │ • CI pipeline    │  │ • Per region     │                            │ │
│  │  └──────────────────┘  └──────────────────┘                            │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

8.2 Environment Structure

Why this matters: A consistent environment structure across HML and production ensures parity, reduces environment-specific bugs, and enables predictable deployments with region-specific configurations.

environments/
├── hml/
│   └── us-east-1/
│       ├── values.yaml          # HML-specific config
│       └── istio/               # Istio policies
│           └── authorization-policy.yaml
└── prod/
    ├── us-east-1/               # Primary region
    │   ├── values.yaml
    │   └── istio/
    └── sa-east-1/               # Secondary region
        ├── values.yaml
        └── istio/

📖 Full Documentation: See PRD Section 4.1 - Infrastructure and Section 5 - Repository Structure for detailed specifications. Related ADR: ADR-001.

9. Key Metrics & SLOs

Why this matters: Defining clear KPIs and SLOs establishes measurable reliability targets, enables objective decision-making around technical investments, and creates accountability for platform health.

9.1 Platform KPIs

Metric	Target
Deployment Frequency	≥ 10/day/service
Lead Time for Changes	< 1 hour
Mean Time to Recovery (MTTR)	< 30 minutes
Change Failure Rate	< 5%
Platform Uptime	99.9%
Region Failover Time	< 5 minutes

9.2 Service Tiers

Why this matters: Tiered service classification allows appropriate resource allocation, sets realistic expectations per service criticality, and prevents over-engineering of non-critical components.

Tier	Availability	Error Rate	Latency P99	Examples
Tier 1	99.99%	< 0.01%	< 100ms	Payment, Auth
Tier 2	99.9%	< 0.1%	< 500ms	Orders, Users
Tier 3	99.5%	< 1%	< 2s	Reports, Analytics

📖 Full Documentation: See PRD Section 15 - Success Metrics for detailed specifications.

10. Resilience & HA

Why this matters: Designing for resilience ensures the platform can withstand and recover from failures gracefully, protecting revenue and user trust during inevitable infrastructure incidents.

10.1 Failure Scenarios

Scenario	RTO	Strategy
Pod crash	< 30s	Kubernetes auto-restart
Node failure	< 5min	Pod rescheduling + PDB
AZ failure	< 15min	Multi-AZ node groups
Region failure	< 5min	Route53 failover

10.2 HA Configuration

Why this matters: Explicit HA configuration ensures services remain available during maintenance, deployments, and partial failures by distributing workloads and maintaining minimum healthy replicas.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         HIGH AVAILABILITY DESIGN                             │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                          Per Service                                 │   │
│   │                                                                      │   │
│   │   • replicas: 3 (minimum for production)                            │   │
│   │   • PodDisruptionBudget: minAvailable: 2                            │   │
│   │   • TopologySpreadConstraints: spread across AZs                    │   │
│   │   • Liveness/Readiness probes configured                            │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                        Per Cluster/Region                            │   │
│   │                                                                      │   │
│   │   • 3 Availability Zones                                            │   │
│   │   • Node groups spread across AZs                                   │   │
│   │   • RDS Multi-AZ (for Keycloak)                                     │   │
│   │   • S3 cross-region replication (for observability)                 │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           Global                                     │   │
│   │                                                                      │   │
│   │   • Route53 latency-based routing                                   │   │
│   │   • Route53 health checks with failover                             │   │
│   │   • ECR in primary region (cross-region pull)                       │   │
│   │   • Keycloak config sync via GitOps                                 │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

📖 Full Documentation: See PRD Section 12 - Resilience and High Availability for detailed specifications. Related ADRs: ADR-006, ADR-009.

11. Application Resources (Self-Service)

Why this matters: Self-service resource provisioning eliminates bottlenecks on platform teams, accelerates development velocity, and ensures resources are created with consistent security and compliance standards.

Application teams can provision AWS resources (SQS, RDS, S3, etc.) via self-service in Backstage.

Resource Provisioning Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                     APPLICATION RESOURCE PROVISIONING                        │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                         Backstage Portal                               │  │
│  │                                                                        │  │
│  │   Developer fills form:                                                │  │
│  │   • Service: payment-api                                               │  │
│  │   • Type: SQS                                                          │  │
│  │   • Name: order-events                                                 │  │
│  │   • Regions: us-east-1, sa-east-1                                     │  │
│  │                                                                        │  │
│  └────────────────────────────────┬──────────────────────────────────────┘  │
│                                   │                                          │
│                                   ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                      GitHub Pull Request                               │  │
│  │                                                                        │  │
│  │   platform-terraform-aws-projects-resources/                           │  │
│  │   └── helpdev-prod/prod/us-east-1/sqs/payment-api/order-events.tf     │  │
│  │                                                                        │  │
│  │   Uses versioned module:                                               │  │
│  │   source = "...platform-terraform-aws-modules//modules/sqs?ref=v1.2.0" │  │
│  │                                                                        │  │
│  └────────────────────────────────┬──────────────────────────────────────┘  │
│                                   │                                          │
│                    ┌──────────────┴──────────────┐                          │
│                    │         Approval             │                          │
│                    │  HML: 1 team member          │                          │
│                    │  PROD: 2 (+ Platform Team)   │                          │
│                    └──────────────┬──────────────┘                          │
│                                   │                                          │
│                                   ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                     GitHub Actions                                     │  │
│  │                                                                        │  │
│  │   terraform plan → terraform apply → Store outputs in Secrets Manager │  │
│  │                                                                        │  │
│  └────────────────────────────────┬──────────────────────────────────────┘  │
│                                   │                                          │
│                                   ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                     Service Consumption                                │  │
│  │                                                                        │  │
│  │   AWS Secrets Manager → External Secrets Operator → K8s Secret → Pod │  │
│  │                                                                        │  │
│  │   env:                                                                 │  │
│  │     - name: SQS_ORDER_EVENTS_URL                                       │  │
│  │       valueFrom:                                                       │  │
│  │         secretKeyRef:                                                  │  │
│  │           name: payment-api-resources                                  │  │
│  │           key: sqs-order-events-url                                    │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Available Resource Types

Resource	Module	Use Cases
SQS	`sqs/`	Message queues, DLQ
SNS	`sns/`	Pub/sub, notifications
RDS	`rds/`	PostgreSQL, MySQL databases
S3	`s3/`	Object storage, backups
DynamoDB	`dynamodb/`	NoSQL, sessions
ElastiCache	`elasticache/`	Redis, Memcached

Repository Structure

Resources are organized in versioned Terraform modules with per-service definitions following account/env/region hierarchy.

📖 Full Documentation: See PRD Section 5 - Repository Structure and Section 4.7 - Developer Portal for detailed specifications. Related ADR: ADR-011.

12. Architecture Decision Records

Why this matters: ADRs capture the context and rationale behind architectural choices, preventing repeated discussions, enabling informed future decisions, and preserving institutional knowledge as team members change.

Key architectural decisions are documented as ADRs:

ADR	Decision
ADR-001	Distributed Argo CD (per cluster)
ADR-002	Federated observability with central visualization
ADR-003	Container images by digest
ADR-004	Istio for service mesh
ADR-005	Keycloak as identity provider
ADR-006	Keycloak per environment
ADR-007	Backstage as developer portal
ADR-008	AWS Secrets Manager + External Secrets
ADR-009	Isolated multi-region architecture
ADR-010	Keycloak GitOps sync strategy
ADR-011	Application Resources via Self-Service
ADR-012	Dedicated Shared Services Account

📖 Full Documentation: See PRD Section 16 - Architecture Decisions (ADRs) for detailed specifications.

Quick Reference

Why this matters: Consistent naming conventions and quick reference materials reduce cognitive load, enable automation, and ensure resources can be easily identified and managed across environments and teams.

Naming Conventions

Resource	Pattern	Example
Cluster	`eks-{env}-{region_short}`	`eks-prod-use1`
Namespace	`{service-name}`	`payment-api`
Repository	`service-{name}` / `service-{name}-infra`	`service-payment-api`
Secret Path	`helpdev/{env}/{region}/{domain}/{service}/{type}`	`helpdev/prod/us-east-1/payments/checkout/database`

Region Codes

AWS Region	Short Code
us-east-1	use1
sa-east-1	sae1
eu-west-1	euw1

Full Technical Specification: See PRD-kubernetes-multi-account-aws.md

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
adr		adr
samples		samples
LICENSE		LICENSE
PRD-kubernetes-multi-account-aws.md		PRD-kubernetes-multi-account-aws.md
README.md		README.md

License

helpdeveloper/helpdev-ops-stack

Folders and files

Latest commit

History

Repository files navigation