Skip to content

A production-ready, cloud-native platform engineered for enterprises that need to scale fast without sacrificing security, reliability, or developer productivity. Built for teams that refuse to choose between moving quickly and building things right.

License

Notifications You must be signed in to change notification settings

helpdeveloper/helpdev-ops-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HelpDev'Ops Enterprise Infrastructure Stack - Architecture Overview

High-Level Engineering Standards & Architecture Guide

Version: 1.0 | Last Updated: 2026-01-25 | Author: Guilherme Biff Zarelli


Executive Summary

A production-ready, cloud-native platform engineered for enterprises that need to scale fast without sacrificing security, reliability, or developer productivity. Built for teams that refuse to choose between moving quickly and building things right.

Why This Platform Exists

Modern enterprises face a fundamental challenge: scaling infrastructure and teams simultaneously while maintaining governance and operational excellence. This platform solves that by providing a self-service foundation where developers ship features—not fight infrastructure—while platform teams maintain control, visibility, and compliance.

Four Pillars

Pillar What You Get
🚀 Developer Experience Self-service deployments, automated scaffolding via Developer Portal, zero-friction onboarding. New services go from idea to production in hours, not weeks.
📈 Scalability & Resilience Multi-region architecture with automatic failover, auto-scaling workloads, and zero-downtime deployments. Built to handle traffic spikes and regional outages gracefully.
🔐 Security & Compliance Zero-trust networking with mTLS everywhere, centralized identity management, and policy-as-code. Security is built-in, not bolted-on.
⚙️ Operational Excellence GitOps-driven workflows, unified observability (metrics, logs, traces), and cost transparency per team. Everything is auditable, reproducible, and automated.

Technology Foundation

Aspect Approach
Infrastructure AWS Multi-Account (HML + Prod), Multi-Region
Orchestration Amazon EKS with GitOps (Argo CD)
Service Mesh Istio with mTLS (zero-trust)
Identity Keycloak (OIDC/OAuth2)
Observability Grafana Stack (Mimir, Loki, Tempo)
Developer Portal Backstage with automated scaffolding
Deployment GitOps with phased multi-region rollout

Business Impact

This platform enables organizations to accelerate time-to-market while reducing operational overhead. Teams onboard in days, deploy with confidence, and operate with full visibility—allowing the business to grow without infrastructure becoming a bottleneck.


1. Architecture Overview

Why this matters: A clear architectural overview enables teams to understand the system's structure at a glance, facilitating faster onboarding, better decision-making, and alignment across all stakeholders on how components interact.

1.1 High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                 SHARED SERVICES ACCOUNT (helpdev-org-main)                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Route53    │  │ CloudFront  │  │    ECR      │  │  Grafana    │         │
│  │ (DNS/LB)    │  │   (CDN)     │  │ (us-east-1) │  │  Central    │         │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘         │
│                          Cross-Account Access (IAM + Resource Policies)      │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
            ┌─────────────────────────┼─────────────────────────┐
            ▼                         ▼                         ▼
┌───────────────────────┐  ┌───────────────────────┐  ┌───────────────────────┐
│   AWS Account: HML    │  │  AWS Account: PROD    │  │  AWS Account: PROD    │
│   Region: us-east-1   │  │  Region: us-east-1    │  │  Region: sa-east-1    │
│                       │  │      (Primary)        │  │     (Secondary)       │
│  ┌─────────────────┐  │  │  ┌─────────────────┐  │  │  ┌─────────────────┐  │
│  │  EKS Cluster    │  │  │  │  EKS Cluster    │  │  │  │  EKS Cluster    │  │
│  │  eks-hml-use1   │  │  │  │  eks-prod-use1  │  │  │  │  eks-prod-sae1  │  │
│  └─────────────────┘  │  │  └─────────────────┘  │  │  └─────────────────┘  │
│                       │  │                       │  │                       │
│  ┌─────┐ ┌─────────┐  │  │  ┌─────┐ ┌─────────┐  │  │  ┌─────┐ ┌─────────┐  │
│  │Istio│ │Keycloak │  │  │  │Istio│ │Keycloak │  │  │  │Istio│ │Keycloak │  │
│  └─────┘ └─────────┘  │  │  └─────┘ └─────────┘  │  │  └─────┘ └─────────┘  │
│                       │  │                       │  │                       │
│  ┌─────────────────┐  │  │  ┌─────────────────┐  │  │  ┌─────────────────┐  │
│  │  Observability  │  │  │  │  Observability  │  │  │  │  Observability  │  │
│  │ Mimir/Loki/Tempo│  │  │  │ Mimir/Loki/Tempo│  │  │  │ Mimir/Loki/Tempo│  │
│  └─────────────────┘  │  │  └─────────────────┘  │  │  └─────────────────┘  │
└───────────────────────┘  └───────────────────────┘  └───────────────────────┘

Note: Global services reside in a dedicated Shared Services Account (helpdev-org-main) following AWS multi-account best practices. See ADR-012.

1.2 Multi-Region Strategy

Why this matters: Multi-region architecture ensures business continuity during regional outages, reduces latency for geographically distributed users, and provides true disaster recovery capabilities without complex cross-region dependencies.

Our multi-region architecture follows the Isolated Regions pattern:

┌─────────────────────────────────────────────────────────────────────────────┐
│                         ISOLATED REGIONS ARCHITECTURE                       │
│                                                                             │
│    Each region is self-contained. No cross-region service communication.    │
│                                                                             │
│   ┌─────────────────────────┐         ┌─────────────────────────┐           │
│   │     us-east-1           │         │     sa-east-1           │           │
│   │     (Primary)           │    ✕    │     (Secondary)         │           │
│   │                         │ No mesh │                         │           │
│   │  ┌───────────────────┐  │ No svc  │  ┌───────────────────┐  │           │
│   │  │ VPC + 3 AZs       │  │ No data │  │ VPC + 3 AZs       │  │           │
│   │  │ EKS Cluster       │  │         │  │ EKS Cluster       │  │           │
│   │  │ Keycloak + RDS    │  │         │  │ Keycloak + RDS    │  │           │
│   │  │ Mimir/Loki/Tempo  │  │         │  │ Mimir/Loki/Tempo  │  │           │
│   │  └───────────────────┘  │         │  └───────────────────┘  │           │
│   └────────────▲────────────┘         └────────────▲────────────┘           │
│                │                                   │                        │
│                └───────────┬───────────────────────┘                        │
│                            │                                                │
│                    ┌───────▼───────┐                                        │
│                    │   Route53     │                                        │
│                    │ Latency-based │                                        │
│                    │   routing     │                                        │
│                    └───────────────┘                                        │
│                            ▲                                                │
│                            │                                                │
│                       End Users                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Benefits:

  • True disaster recovery (region failure = automatic failover)
  • No cross-region latency for service calls
  • Simplified architecture (no mesh federation)
  • Independent scaling per region

📖 Full Documentation: See PRD Section 3 - Solution Architecture for detailed specifications. Related ADRs: ADR-009, ADR-012.


2. Core Principles

Why this matters: Establishing core principles creates a shared foundation for all engineering decisions, ensuring consistency across teams and reducing technical debt by guiding choices before they become problems.

2.1 Engineering Principles

Principle Description
GitOps Git is the single source of truth for all infrastructure and application state
Zero-Trust All service-to-service communication requires mTLS authentication
Immutability Container images are referenced by digest, not mutable tags
Isolation Blast radius minimized through account, cluster, and namespace separation
Self-Service Developers can create and manage services through Backstage portal

2.2 Security Model

Why this matters: A zero-trust security model protects against both external threats and internal lateral movement, ensuring that every request is authenticated and authorized regardless of its origin within the network.

┌─────────────────────────────────────────────────────────────────────────────┐
│                            ZERO-TRUST SECURITY                              │
│                                                                             │
│  ┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐       │
│  │    Identity      │    │    Network       │    │    Data          │       │
│  │                  │    │                  │    │                  │       │
│  │ • Keycloak OIDC  │    │ • Istio mTLS     │    │ • Secrets Manager│       │
│  │ • IRSA (AWS)     │    │ • Network Policies│   │ • KMS Encryption │       │
│  │ • Service Tokens │    │ • VPC Isolation  │    │ • RBAC Controls  │       │
│  └──────────────────┘    └──────────────────┘    └──────────────────┘       │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────┐       │
│  │                    Authorization Flow                            │       │
│  │                                                                  |       │
│  │   Service A ──────────────────────────────────────▶ Service B    │       │
│  │       │                                                 │        │       │
│  │       │ 1. Get JWT from Keycloak                        │        │       │
│  │       │ 2. Include token in request                     │        │       │
│  │       │ 3. Istio validates mTLS + JWT                   │        │       │
│  │       │ 4. AuthorizationPolicy checks roles             │        │       │
│  │       │ 5. Request allowed or denied                    │        │       │
│  └──────────────────────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────────────────────┘

📖 Full Documentation: See PRD Section 2 - Architecture Principles for detailed specifications. Related ADRs: ADR-003, ADR-004.


3. Technology Stack

Why this matters: Standardizing on a curated technology stack reduces cognitive overhead, enables shared expertise across teams, simplifies troubleshooting, and ensures all components are battle-tested and well-integrated.

3.1 Platform Components

┌─────────────────────────────────────────────────────────────────────────────┐
│                           TECHNOLOGY STACK                                  │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                        Developer Experience                         │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │ Backstage │  │  GitHub   │  │  GitHub   │  │ TechDocs  │         │    │
│  │  │  Portal   │  │   Repos   │  │  Actions  │  │           │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         GitOps & Deployment                         │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │  Argo CD  │  │   Helm    │  │  Kustomize│  │ AppSets   │         │    │
│  │  │           │  │  Charts   │  │           │  │           │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Runtime & Service Mesh                         │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │    EKS    │  │   Istio   │  │   KEDA    │  │  External │         │    │
│  │  │           │  │   mTLS    │  │ Autoscaler│  │  Secrets  │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         Observability                               │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │  Grafana  │  │   Mimir   │  │   Loki    │  │   Tempo   │         │    │
│  │  │           │  │  Metrics  │  │   Logs    │  │  Traces   │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Identity & Security                            │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │ Keycloak  │  │  AWS IAM  │  │  Secrets  │  │    KMS    │         │    │
│  │  │   OIDC    │  │   IRSA    │  │  Manager  │  │           │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                       API Management                                │    │
│  │  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐         │    │
│  │  │   Kong    │  │  OpenAPI  │  │   Rate    │  │    JWT    │         │    │
│  │  │  Gateway  │  │   Specs   │  │  Limiting │  │   Auth    │         │    │
│  │  └───────────┘  └───────────┘  └───────────┘  └───────────┘         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

3.2 Version Requirements

Why this matters: Pinning component versions ensures reproducible environments, prevents unexpected breaking changes, and allows coordinated upgrades with proper testing and rollback procedures.

Component Version Purpose
Kubernetes 1.29+ Container orchestration
Istio 1.20+ Service mesh
Argo CD 2.10+ GitOps controller
Keycloak 24+ Identity provider
Grafana 10+ Dashboards & alerting
KEDA 2.12+ Event-driven autoscaling
Kong 3.6+ API Gateway (DB-less mode)

📖 Full Documentation: See PRD Section 4 - Technology Stack for detailed specifications.


4. Developer Experience

Why this matters: A streamlined developer experience accelerates time-to-market, reduces friction in adopting platform standards, and empowers teams to focus on business logic rather than infrastructure concerns.

4.1 Service Creation Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                      NEW SERVICE CREATION (< 10 min)                         │
│                                                                              │
│   ┌─────────┐         ┌──────────────────────────────────────────────┐      │
│   │Developer│         │              Backstage Portal                │      │
│   └────┬────┘         │                                              │      │
│        │              │  ┌──────────────────────────────────────┐    │      │
│        │  1. Fill     │  │         Service Template Form        │    │      │
│        │  ─────────▶  │  │                                      │    │      │
│        │              │  │  Name:     [payment-api           ]  │    │      │
│        │              │  │  Owner:    [payments-team ▼       ]  │    │      │
│        │              │  │  Domain:   [payments ▼            ]  │    │      │
│        │              │  │  Language: [Java ▼                ]  │    │      │
│        │              │  │  Regions:  [☑ us-east-1 ☑ sa-east-1] │    │      │
│        │              │  │                                      │    │      │
│        │              │  │         [ Create Service ]           │    │      │
│        │              │  └──────────────────────────────────────┘    │      │
│        │              └──────────────────────────────────────────────┘      │
│        │                                    │                               │
│        │                         2. Automated Creation                      │
│        │                                    ▼                               │
│        │              ┌──────────────────────────────────────────────┐      │
│        │              │                                              │      │
│        │              │  ┌────────────────┐  ┌────────────────┐      │      │
│        │              │  │ payment-api    │  │ payment-api-   │      │      │
│        │              │  │ (code repo)    │  │ infra (GitOps) │      │      │
│        │              │  └────────────────┘  └────────────────┘      │      │
│        │              │                                              │      │
│        │              │  ┌────────────────┐  ┌────────────────┐      │      │
│        │              │  │ Argo CD App    │  │ Keycloak Client│      │      │
│        │              │  │ (multi-region) │  │ (GitOps sync)  │      │      │
│        │              │  └────────────────┘  └────────────────┘      │      │
│        │              │                                              │      │
│        │              └──────────────────────────────────────────────┘      │
│        │                                    │                                │
│        │  3. Start                          │                                │
│   ◀────┴─ coding!  ◀────────────────────────┘                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

4.2 Repository Structure

Each service follows a standardized two-repository pattern (code + infra) enabling independent versioning and clear ownership boundaries.

📖 Full Documentation: See PRD Section 5 - Repository Structure for detailed specifications.

4.3 Deployment Pipeline

Why this matters: A GitOps-based deployment pipeline provides auditability, reproducibility, and automatic drift detection, while phased rollouts minimize blast radius and enable safe multi-region deployments.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           GITOPS DEPLOYMENT FLOW                            │
│                                                                             │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌───────────┐  │
│  │   PR Open   │────▶│ Build/Test  │────▶│  Label:     │────▶│  HML      │  │
│  │             │     │  + Scan     │     │ deploy-hml  │     │  Deploy   │  │
│  └─────────────┘     └─────────────┘     └─────────────┘     └───────────┘  │
│                                                                     │       │
│                                                                     ▼       │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌───────────┐  │
│  │   Merge     │────▶│  Release    │────▶│   Phased    │────▶│  PROD     │  │
│  │   to main   │     │   v1.2.3    │     │   Rollout   │     │  Deploy   │  │
│  └─────────────┘     └─────────────┘     └─────────────┘     └───────────┘  │
│                                                                             │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                             │
│                         PHASED MULTI-REGION ROLLOUT                         │
│                                                                             │
│      ┌─────────────────┐                      ┌─────────────────┐           │
│      │   us-east-1     │                      │   sa-east-1     │           │
│      │                 │                      │                 │           │
│      │  1. Deploy      │    30min delay       │  2. Deploy      │           │
│      │  2. Health ✓    │─────────────────────▶│  3. Health ✓    │           │
│      │  3. Monitor     │                      │  4. Complete    │           │
│      └─────────────────┘                      └─────────────────┘           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

📖 Full Documentation: See PRD Section 5 - Repository Structure, Section 6 - Deploy Flow, and Section 4.7 - Developer Portal for detailed specifications. Related ADRs: ADR-001, ADR-007.


5. Observability

Why this matters: Comprehensive observability through metrics, logs, and traces enables rapid incident detection, root cause analysis, and data-driven capacity planning—essential for maintaining reliability at scale.

5.1 Three Pillars Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         OBSERVABILITY ARCHITECTURE                           │
│                                                                              │
│                           ┌─────────────────┐                                │
│                           │ Grafana Central │                                │
│                           │  (prod/use1)    │                                │
│                           └────────┬────────┘                                │
│                                    │                                         │
│              ┌─────────────────────┼─────────────────────┐                   │
│              │                     │                     │                   │
│              ▼                     ▼                     ▼                   │
│   ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐          │
│   │    HML/use1      │  │   PROD/use1      │  │   PROD/sae1      │          │
│   │                  │  │                  │  │                  │          │
│   │ ┌──────────────┐ │  │ ┌──────────────┐ │  │ ┌──────────────┐ │          │
│   │ │    Mimir     │ │  │ │    Mimir     │ │  │ │    Mimir     │ │          │
│   │ │   Metrics    │ │  │ │   Metrics    │ │  │ │   Metrics    │ │          │
│   │ └──────────────┘ │  │ └──────────────┘ │  │ └──────────────┘ │          │
│   │ ┌──────────────┐ │  │ ┌──────────────┐ │  │ ┌──────────────┐ │          │
│   │ │    Loki      │ │  │ │    Loki      │ │  │ │    Loki      │ │          │
│   │ │    Logs      │ │  │ │    Logs      │ │  │ │    Logs      │ │          │
│   │ └──────────────┘ │  │ └──────────────┘ │  │ └──────────────┘ │          │
│   │ ┌──────────────┐ │  │ ┌──────────────┐ │  │ ┌──────────────┐ │          │
│   │ │    Tempo     │ │  │ │    Tempo     │ │  │ │    Tempo     │ │          │
│   │ │   Traces     │ │  │ │   Traces     │ │  │ │   Traces     │ │          │
│   │ └──────────────┘ │  │ └──────────────┘ │  │ └──────────────┘ │          │
│   └──────────────────┘  └──────────────────┘  └──────────────────┘          │
│                                                                              │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                      Data Collection (per cluster)                    │  │
│   │                                                                       │  │
│   │    ┌────────────────┐    ┌────────────────┐    ┌────────────────┐    │  │
│   │    │ OTel Collector │    │ OTel Collector │    │ OTel Collector │    │  │
│   │    │   (Gateway)    │    │   (Gateway)    │    │   (Gateway)    │    │  │
│   │    └───────┬────────┘    └───────┬────────┘    └───────┬────────┘    │  │
│   │            │                     │                     │             │  │
│   │            ▼                     ▼                     ▼             │  │
│   │    ┌────────────────┐    ┌────────────────┐    ┌────────────────┐    │  │
│   │    │   Services     │    │   Services     │    │   Services     │    │  │
│   │    │ (auto-instr)   │    │ (auto-instr)   │    │ (auto-instr)   │    │  │
│   │    └────────────────┘    └────────────────┘    └────────────────┘    │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

5.2 Key Dashboards

Why this matters: Purpose-built dashboards reduce mean time to detection (MTTD) by surfacing relevant information to the right audience, enabling faster response and better operational awareness.

Dashboard Purpose Audience
Service Overview Health, latency, error rate per service Dev Teams
Infrastructure Node, pod, resource utilization Platform Team
Istio Mesh Traffic flow, mTLS status, policies Platform Team
Keycloak Auth requests, token issuance, failures Security Team
Business SLOs SLI/SLO tracking, error budgets All Teams

📖 Full Documentation: See PRD Section 7 - Observability for detailed specifications. Related ADR: ADR-002.


6. Service Mesh & Security

Why this matters: A service mesh provides consistent security, observability, and traffic management across all services without requiring application code changes, enabling platform-wide policy enforcement.

6.1 Istio Traffic Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                          SERVICE MESH ARCHITECTURE                           │
│                                                                              │
│   External Traffic                                                           │
│         │                                                                    │
│         ▼                                                                    │
│   ┌─────────────┐                                                           │
│   │   Istio     │                                                           │
│   │  Ingress    │                                                           │
│   │  Gateway    │                                                           │
│   └──────┬──────┘                                                           │
│          │                                                                   │
│          ▼                                                                   │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                          Service Mesh                                 │  │
│   │                                                                       │  │
│   │  ┌─────────────────────┐        ┌─────────────────────┐              │  │
│   │  │     Service A       │        │     Service B       │              │  │
│   │  │  ┌───────────────┐  │  mTLS  │  ┌───────────────┐  │              │  │
│   │  │  │   App Pod     │  │◀──────▶│  │   App Pod     │  │              │  │
│   │  │  │  ┌─────────┐  │  │        │  │  ┌─────────┐  │  │              │  │
│   │  │  │  │ Envoy   │  │  │        │  │  │ Envoy   │  │  │              │  │
│   │  │  │  │ Sidecar │  │  │        │  │  │ Sidecar │  │  │              │  │
│   │  │  │  └─────────┘  │  │        │  │  └─────────┘  │  │              │  │
│   │  │  └───────────────┘  │        │  └───────────────┘  │              │  │
│   │  └─────────────────────┘        └─────────────────────┘              │  │
│   │                                                                       │  │
│   │  ┌───────────────────────────────────────────────────────────────┐   │  │
│   │  │                      Security Policies                         │   │  │
│   │  │                                                                │   │  │
│   │  │   PeerAuthentication ──▶ mTLS STRICT (no plaintext)           │   │  │
│   │  │   RequestAuthentication ──▶ JWT validation via Keycloak       │   │  │
│   │  │   AuthorizationPolicy ──▶ Role-based access control           │   │  │
│   │  │                                                                │   │  │
│   │  └───────────────────────────────────────────────────────────────┘   │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

6.2 Service-to-Service Authorization

Why this matters: Explicit service-to-service authorization ensures that only permitted services can communicate, preventing unauthorized access and limiting the blast radius of compromised components.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SERVICE AUTHORIZATION FLOW                               │
│                                                                             │
│  ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐                │
│  │ Order   │     │Keycloak │     │ Istio   │     │ Payment │                │
│  │ Service │     │         │     │ Sidecar │     │ Service │                │
│  └────┬────┘     └────┬────┘     └────┬────┘     └────┬────┘                │
│       │               │               │               │                     │
│       │ 1. Get token  │               │               │                     │
│       │──────────────▶│               │               │                     │
│       │               │               │               │                     │
│       │◀──────────────│               │               │                     │
│       │   JWT token   │               │               │                     │
│       │               │               │               │                     │
│       │ 2. Call with token            │               │                     │
│       │───────────────────────────────▶               │                     │
│       │                               │               │                     │
│       │               │     3. Validate JWT           │                     │
│       │               │     Check roles               │                     │
│       │               │     Verify mTLS               │                     │
│       │               │◀ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ▶│                     │
│       │               │               │               │                     │
│       │               │               │ 4. Forward    │                     │
│       │               │               │───────────────▶                     │
│       │               │               │               │                     │
│       │               │               │◀──────────────│                     │
│       │◀──────────────────────────────│  Response     │                     │
│       │               │               │               │                     │
└─────────────────────────────────────────────────────────────────────────────┘

📖 Full Documentation: See PRD Section 8 - Service Mesh (Istio) and Section 11 - Security and Governance for detailed specifications. Related ADRs: ADR-004, ADR-005, ADR-008.


7. API Management

Why this matters: A centralized API management layer provides consistent authentication, rate limiting, and documentation across all services, enabling secure external API exposure while maintaining self-service capabilities for development teams.

Kong Gateway serves as the API management layer, running in DB-less mode with all configuration managed via GitOps. Kong instances are organized by business domain (e.g., ecommerce-external, payments-internal), allowing teams to manage their API configurations independently while the platform team maintains infrastructure.

7.1 Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           API MANAGEMENT FLOW                               │
│                                                                             │
│   External Traffic                          Internal Traffic                │
│         │                                         │                         │
│         ▼                                         │                         │
│   ┌───────────┐                                   │                         │
│   │    WAF    │                                   │                         │
│   │  (Shield) │                                   │                         │
│   └─────┬─────┘                                   │                         │
│         ▼                                         ▼                         │
│   ┌─────────────────┐                    ┌─────────────────┐                │
│   │ Kong External   │                    │ Kong Internal   │                │
│   │ (LoadBalancer)  │                    │ (ClusterIP)     │                │
│   │                 │                    │                 │                │
│   │ • JWT Auth      │                    │ • Routing       │                │
│   │ • Rate Limit    │                    │ • Load Balance  │                │
│   │ • Request Trans │                    │ • Plugins       │                │
│   └────────┬────────┘                    └────────┬────────┘                │
│            │                                      │                         │
│            └──────────────┬───────────────────────┘                         │
│                           ▼                                                 │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                    Istio Service Mesh (mTLS)                        │   │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │   │
│   │  │ checkout-api│  │  cart-api   │  │ payment-api │  ◀── Envoy      │   │
│   │  │ + sidecar   │  │ + sidecar   │  │ + sidecar   │      Sidecars   │   │
│   │  └─────────────┘  └─────────────┘  └─────────────┘                  │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                      GitOps Configuration                           │   │
│   │                                                                     │   │
│   │  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │   │
│   │  │  Backstage  │───▶│platform-apis│───▶│ Kong Runtime (DB-less) │  │   │
│   │  │  (Portal)   │    │ (Git Repo)  │    │ Config via ConfigMap   │  │   │
│   │  └─────────────┘    └─────────────┘    └─────────────────────────┘  │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

7.2 Key Capabilities

Capability Description
External API Exposure Public APIs with WAF integration, rate limiting, and JWT authentication via Keycloak
Internal API Routing Service-to-service communication within and across domains
Centralized OpenAPI Specs API documentation automatically published to Developer Portal (Backstage)
Self-Service Configuration Teams manage routes and plugins via PRs to platform-apis repository

7.3 Instance Types

Type Network Use Case
Internal ClusterIP APIs consumed within the cluster
External LoadBalancer + WAF APIs exposed to the internet

📚 Deep Dive: See PRD Section 10 - API Management and ADR-013: Kong API Gateway for complete details on repository structure, configuration patterns, and Backstage integration.


8. Infrastructure as Code

Why this matters: Infrastructure as Code ensures reproducible, auditable, and version-controlled infrastructure changes, eliminating configuration drift and enabling disaster recovery through complete environment reconstruction.

8.1 Repository Organization

┌─────────────────────────────────────────────────────────────────────────────┐
│                         PLATFORM REPOSITORIES                                │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                        Managed by Platform Team                         │ │
│  │                                                                         │ │
│  │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐      │ │
│  │  │platform-terraform│  │platform-argocd-  │  │platform-helm-    │      │ │
│  │  │                  │  │config            │  │charts            │      │ │
│  │  │ • VPC/EKS/RDS    │  │ • App-of-Apps    │  │ • service-base   │      │ │
│  │  │ • IAM/KMS        │  │ • ApplicationSets│  │ • cronjob-base   │      │ │
│  │  │ • Per region     │  │ • Per cluster    │  │ • Shared helpers │      │ │
│  │  └──────────────────┘  └──────────────────┘  └──────────────────┘      │ │
│  │                                                                         │ │
│  │  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐      │ │
│  │  │platform-         │  │platform-keycloak │  │platform-pipelines│      │ │
│  │  │observability     │  │                  │  │                  │      │ │
│  │  │ • Grafana stack  │  │ • Realm configs  │  │ • CI/CD templates│      │ │
│  │  │ • Dashboards     │  │ • Clients (JSON) │  │ • Actions        │      │ │
│  │  │ • Alerts         │  │ • Per account    │  │ • Phased rollout │      │ │
│  │  └──────────────────┘  └──────────────────┘  └──────────────────┘      │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │                     Managed by Application Teams                        │ │
│  │                                                                         │ │
│  │  ┌──────────────────┐  ┌──────────────────┐                            │ │
│  │  │ service-xyz      │  │ service-xyz-infra│     (created via           │ │
│  │  │                  │  │                  │      Backstage)            │ │
│  │  │ • Source code    │  │ • Helm values    │                            │ │
│  │  │ • Dockerfile     │  │ • Istio policies │                            │ │
│  │  │ • CI pipeline    │  │ • Per region     │                            │ │
│  │  └──────────────────┘  └──────────────────┘                            │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

8.2 Environment Structure

Why this matters: A consistent environment structure across HML and production ensures parity, reduces environment-specific bugs, and enables predictable deployments with region-specific configurations.

environments/
├── hml/
│   └── us-east-1/
│       ├── values.yaml          # HML-specific config
│       └── istio/               # Istio policies
│           └── authorization-policy.yaml
└── prod/
    ├── us-east-1/               # Primary region
    │   ├── values.yaml
    │   └── istio/
    └── sa-east-1/               # Secondary region
        ├── values.yaml
        └── istio/

📖 Full Documentation: See PRD Section 4.1 - Infrastructure and Section 5 - Repository Structure for detailed specifications. Related ADR: ADR-001.


9. Key Metrics & SLOs

Why this matters: Defining clear KPIs and SLOs establishes measurable reliability targets, enables objective decision-making around technical investments, and creates accountability for platform health.

9.1 Platform KPIs

Metric Target
Deployment Frequency ≥ 10/day/service
Lead Time for Changes < 1 hour
Mean Time to Recovery (MTTR) < 30 minutes
Change Failure Rate < 5%
Platform Uptime 99.9%
Region Failover Time < 5 minutes

9.2 Service Tiers

Why this matters: Tiered service classification allows appropriate resource allocation, sets realistic expectations per service criticality, and prevents over-engineering of non-critical components.

Tier Availability Error Rate Latency P99 Examples
Tier 1 99.99% < 0.01% < 100ms Payment, Auth
Tier 2 99.9% < 0.1% < 500ms Orders, Users
Tier 3 99.5% < 1% < 2s Reports, Analytics

📖 Full Documentation: See PRD Section 15 - Success Metrics for detailed specifications.


10. Resilience & HA

Why this matters: Designing for resilience ensures the platform can withstand and recover from failures gracefully, protecting revenue and user trust during inevitable infrastructure incidents.

10.1 Failure Scenarios

Scenario RTO Strategy
Pod crash < 30s Kubernetes auto-restart
Node failure < 5min Pod rescheduling + PDB
AZ failure < 15min Multi-AZ node groups
Region failure < 5min Route53 failover

10.2 HA Configuration

Why this matters: Explicit HA configuration ensures services remain available during maintenance, deployments, and partial failures by distributing workloads and maintaining minimum healthy replicas.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         HIGH AVAILABILITY DESIGN                             │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                          Per Service                                 │   │
│   │                                                                      │   │
│   │   • replicas: 3 (minimum for production)                            │   │
│   │   • PodDisruptionBudget: minAvailable: 2                            │   │
│   │   • TopologySpreadConstraints: spread across AZs                    │   │
│   │   • Liveness/Readiness probes configured                            │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                        Per Cluster/Region                            │   │
│   │                                                                      │   │
│   │   • 3 Availability Zones                                            │   │
│   │   • Node groups spread across AZs                                   │   │
│   │   • RDS Multi-AZ (for Keycloak)                                     │   │
│   │   • S3 cross-region replication (for observability)                 │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           Global                                     │   │
│   │                                                                      │   │
│   │   • Route53 latency-based routing                                   │   │
│   │   • Route53 health checks with failover                             │   │
│   │   • ECR in primary region (cross-region pull)                       │   │
│   │   • Keycloak config sync via GitOps                                 │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────┘

📖 Full Documentation: See PRD Section 12 - Resilience and High Availability for detailed specifications. Related ADRs: ADR-006, ADR-009.


11. Application Resources (Self-Service)

Why this matters: Self-service resource provisioning eliminates bottlenecks on platform teams, accelerates development velocity, and ensures resources are created with consistent security and compliance standards.

Application teams can provision AWS resources (SQS, RDS, S3, etc.) via self-service in Backstage.

Resource Provisioning Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                     APPLICATION RESOURCE PROVISIONING                        │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                         Backstage Portal                               │  │
│  │                                                                        │  │
│  │   Developer fills form:                                                │  │
│  │   • Service: payment-api                                               │  │
│  │   • Type: SQS                                                          │  │
│  │   • Name: order-events                                                 │  │
│  │   • Regions: us-east-1, sa-east-1                                     │  │
│  │                                                                        │  │
│  └────────────────────────────────┬──────────────────────────────────────┘  │
│                                   │                                          │
│                                   ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                      GitHub Pull Request                               │  │
│  │                                                                        │  │
│  │   platform-terraform-aws-projects-resources/                           │  │
│  │   └── helpdev-prod/prod/us-east-1/sqs/payment-api/order-events.tf     │  │
│  │                                                                        │  │
│  │   Uses versioned module:                                               │  │
│  │   source = "...platform-terraform-aws-modules//modules/sqs?ref=v1.2.0" │  │
│  │                                                                        │  │
│  └────────────────────────────────┬──────────────────────────────────────┘  │
│                                   │                                          │
│                    ┌──────────────┴──────────────┐                          │
│                    │         Approval             │                          │
│                    │  HML: 1 team member          │                          │
│                    │  PROD: 2 (+ Platform Team)   │                          │
│                    └──────────────┬──────────────┘                          │
│                                   │                                          │
│                                   ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                     GitHub Actions                                     │  │
│  │                                                                        │  │
│  │   terraform plan → terraform apply → Store outputs in Secrets Manager │  │
│  │                                                                        │  │
│  └────────────────────────────────┬──────────────────────────────────────┘  │
│                                   │                                          │
│                                   ▼                                          │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                     Service Consumption                                │  │
│  │                                                                        │  │
│  │   AWS Secrets Manager → External Secrets Operator → K8s Secret → Pod │  │
│  │                                                                        │  │
│  │   env:                                                                 │  │
│  │     - name: SQS_ORDER_EVENTS_URL                                       │  │
│  │       valueFrom:                                                       │  │
│  │         secretKeyRef:                                                  │  │
│  │           name: payment-api-resources                                  │  │
│  │           key: sqs-order-events-url                                    │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Available Resource Types

Resource Module Use Cases
SQS sqs/ Message queues, DLQ
SNS sns/ Pub/sub, notifications
RDS rds/ PostgreSQL, MySQL databases
S3 s3/ Object storage, backups
DynamoDB dynamodb/ NoSQL, sessions
ElastiCache elasticache/ Redis, Memcached

Repository Structure

Resources are organized in versioned Terraform modules with per-service definitions following account/env/region hierarchy.

📖 Full Documentation: See PRD Section 5 - Repository Structure and Section 4.7 - Developer Portal for detailed specifications. Related ADR: ADR-011.


12. Architecture Decision Records

Why this matters: ADRs capture the context and rationale behind architectural choices, preventing repeated discussions, enabling informed future decisions, and preserving institutional knowledge as team members change.

Key architectural decisions are documented as ADRs:

ADR Decision
ADR-001 Distributed Argo CD (per cluster)
ADR-002 Federated observability with central visualization
ADR-003 Container images by digest
ADR-004 Istio for service mesh
ADR-005 Keycloak as identity provider
ADR-006 Keycloak per environment
ADR-007 Backstage as developer portal
ADR-008 AWS Secrets Manager + External Secrets
ADR-009 Isolated multi-region architecture
ADR-010 Keycloak GitOps sync strategy
ADR-011 Application Resources via Self-Service
ADR-012 Dedicated Shared Services Account

📖 Full Documentation: See PRD Section 16 - Architecture Decisions (ADRs) for detailed specifications.


Quick Reference

Why this matters: Consistent naming conventions and quick reference materials reduce cognitive load, enable automation, and ensure resources can be easily identified and managed across environments and teams.

Naming Conventions

Resource Pattern Example
Cluster eks-{env}-{region_short} eks-prod-use1
Namespace {service-name} payment-api
Repository service-{name} / service-{name}-infra service-payment-api
Secret Path helpdev/{env}/{region}/{domain}/{service}/{type} helpdev/prod/us-east-1/payments/checkout/database

Region Codes

AWS Region Short Code
us-east-1 use1
sa-east-1 sae1
eu-west-1 euw1

Full Technical Specification: See PRD-kubernetes-multi-account-aws.md

About

A production-ready, cloud-native platform engineered for enterprises that need to scale fast without sacrificing security, reliability, or developer productivity. Built for teams that refuse to choose between moving quickly and building things right.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published