diff --git a/docs/superpowers/specs/2026-05-05-environment-setup-automation-design.md b/docs/superpowers/specs/2026-05-05-environment-setup-automation-design.md new file mode 100644 index 0000000000000..2741b9beafdc4 --- /dev/null +++ b/docs/superpowers/specs/2026-05-05-environment-setup-automation-design.md @@ -0,0 +1,547 @@ +# Environment Setup Automation Design + +**Date:** 2026-05-05 +**Last revised:** 2026-05-12 +**Status:** Draft +**Scope:** `ddev lab` — agentic framework for provisioning, seeding, and load-testing Datadog integration environments on remote cloud infrastructure (AWS and GCP) + +--- + +## 1. Problem and Goals + +Setting up environments for Datadog integrations is one of the highest-friction steps in integration development. The pain concentrates in three areas: + +1. **Infrastructure complexity** — Some integrations (Oracle DB, IBM MQ, Kafka + ZooKeeper, Lustre clusters) require real VMs, licenses, or multi-node topologies that cannot run locally and require significant manual effort. +2. **Data quality** — Test datasets are minimal or absent; developers run checks against empty systems that don't reflect production behavior, masking metric collection gaps. +3. **Portability** — Environment setup knowledge is tribal. When a developer leaves or switches integrations, the environment has to be reconstructed from scratch. + +### Goals + +| Priority | Goal | +|----------|------| +| Primary | Provision a fully running integration environment on remote cloud infrastructure with a single command | +| Primary | Seed the environment with realistic-looking data | +| Primary | Generate continuous background load that exercises the integration's metric surface | +| Primary | Include a configured Datadog Agent in every lab, ready to run the integration check | +| Optional | Produce a human-readable record of what was set up and why | + +### Non-goals + +- Replacing `ddev env` for local unit/integration testing — `ddev lab` is for E2E and exploratory work against live infrastructure +- Supporting every integration on day one — start with the hardest ones (Oracle, IBM MQ, Kafka, Cassandra, Lustre) +- Running any part of the environment on a developer's local machine — all labs are remote + +--- + +## 2. Architecture Overview + +The system has two distinct layers: + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ AI Research Phase (runs once per integration version) │ +│ │ +│ Sources: vendor docs, Docker Hub, metadata.csv, manifest.json │ +│ (Does NOT require existing tests or compose files) │ +│ │ +│ Produces complete tests/lab/ subtree: │ +│ lab.yaml ← machine-readable manifest │ +│ tests/lab/RESEARCH.md ← human-readable narrative │ +│ tests/lab/compose/ ← service topology (Docker) │ +│ tests/lab/seed/ ← numbered, idempotent scripts │ +│ tests/lab/load/ ← Locust or k6 script │ +│ tests/lab/agent/ ← Datadog Agent integration config │ +│ tests/lab/provision/ ← Ansible playbook (bare-metal only)│ +│ [starter Pulumi program] ← reviewed + merged to infra/ │ +└────────────────────────┬─────────────────────────────────────────┘ + │ artifacts reviewed + committed + ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ Deterministic Execution Layer (ddev lab CLI) │ +│ │ +│ create: pulumi up → wait SSH → healthchecks → │ +│ seed → load → start Agent → register │ +│ │ +│ stop: stop load → docker compose down → update registry │ +│ │ +│ destroy: stop load → stop services → pulumi destroy → │ +│ deregister │ +└──────────────────────────────────────────────────────────────────┘ +``` + +No AI runs at execution time. The AI phase is a one-time investment per integration version, repeatable when the technology version changes. The execution layer is a simple deterministic pipeline any team member can run. + +--- + +## 3. The Research Phase + +### Trigger + +```bash +ddev lab research # generate all artifacts fresh +ddev lab research --update --version 3.8 # update for a new tech version +``` + +### Information sources + +The research agent has access to the following — **and nothing else**. The integration may be brand new with no existing tests or compose files, so the agent cannot rely on them. + +| Source | What it provides | +|--------|-----------------| +| `/metadata.csv` | The full set of metrics the integration collects — tells the agent what data must exist to make metric values non-zero | +| `/manifest.json` | Integration display name, tags, categories — context for documentation searches | +| Vendor documentation (WebFetch/WebSearch) | Service topology requirements, configuration, resource minimums | +| Docker Hub (WebFetch) | Canonical image names, available tags, recommended versions | + +### What the agent produces + +The agent writes a complete `tests/lab/` subtree under the integration directory: + +``` +/ + lab.yaml ← machine-readable manifest (see Section 4) + tests/ + lab/ + RESEARCH.md ← human-readable narrative (see below) + compose/ + docker-compose.yml ← service(s) + Datadog Agent as Docker services + seed/ + 01_.sh ← numbered, idempotent, ordered + 02_.py + ... + load/ + locustfile.py ← or k6_script.js + agent/ + conf.yaml ← Datadog Agent integration config (instances, logs) + healthcheck/ + check_.sh ← custom healthcheck scripts (script type only) + provision/ ← bare-metal and cluster runtimes only + install_.yml ← Ansible playbook + teardown_.yml +``` + +Additionally, the agent produces a **starter Pulumi program** (`infra/labs//`) that a team member reviews and merges alongside the integrations-core artifacts (see Section 7). + +### RESEARCH.md — human-readable narrative + +`lab.yaml` is clean YAML with no inline comments. The agent's reasoning lives in `tests/lab/RESEARCH.md` instead. This file explains: + +- What this technology is and how it is typically deployed in production +- Why specific choices were made — instance type, service topology, partition counts, network configuration +- Which metrics each seed script targets (referenced from `metadata.csv`) +- Operational gotchas identified in the vendor documentation +- Links to the specific documentation pages and Docker Hub tags consulted + +The human reviewer reads `RESEARCH.md` to audit the agent's reasoning without needing to repeat the research. When the lab is updated for a new tech version, the agent rewrites the affected sections of `RESEARCH.md` alongside the affected artifacts. + +### Re-generation + +`ddev lab research --update kafka --version 3.8` diffs against the existing artifacts. The agent fetches the new version's changelog and release notes, identifies changed APIs or configuration keys, and updates the affected files. Unchanged files are left untouched. + +--- + +## 4. The `lab.yaml` Schema + +`lab.yaml` is the machine-readable manifest. It declares infrastructure, healthchecks, execution order, and Agent configuration. All narrative explaining the choices lives in `RESEARCH.md`. + +```yaml +metadata: + integration: kafka + tech_version: "3.7" + generated_at: "2026-05-05" + +infrastructure: + cloud: aws + runtime: + type: compose + file: tests/lab/compose/docker-compose.yml + +healthchecks: + - name: kafka-broker + type: tcp + host: localhost + port: 9092 + timeout: 120s + interval: 5s + - name: zookeeper + type: tcp + host: localhost + port: 2181 + timeout: 60s + interval: 5s + +seed: + - type: script + path: tests/lab/seed/01_create_topics.sh + - type: script + path: tests/lab/seed/02_produce_sample_events.py + +load: + driver: locust + script: tests/lab/load/locustfile.py + target_rps: 20 + +agent: + image: datadog/agent:latest + config: tests/lab/agent/conf.yaml + api_key_secret: agent-integrations-dev/datadog-api-key +``` + +### Bare-metal runtime example (Oracle) + +```yaml +infrastructure: + cloud: aws + runtime: + type: bare-metal + provisioner: tests/lab/provision/install_oracle.yml + teardown: tests/lab/provision/teardown_oracle.yml + +agent: + image: datadog/agent:latest + config: tests/lab/agent/conf.yaml + api_key_secret: agent-integrations-dev/datadog-api-key +``` + +### Cluster runtime example (Lustre) + +```yaml +metadata: + integration: lustre + tech_version: "2.15" + +infrastructure: + cloud: aws + runtime: + type: cluster + nodes: + - role: mgs_mds + count: 1 + instance_type: r6i.xlarge + - role: oss + count: 2 + instance_type: r6i.xlarge + - role: client + count: 1 + instance_type: t3.large + network: + lnet_port: 988 + provisioner: tests/lab/provision/install_lustre_cluster.yml + teardown: tests/lab/provision/teardown_lustre_cluster.yml + +healthchecks: + - name: mgs + type: script + script: tests/lab/healthcheck/check_mgs.sh + timeout: 180s + interval: 10s + - name: mds + type: script + script: tests/lab/healthcheck/check_mds.sh + timeout: 180s + interval: 10s + - name: oss + type: script + script: tests/lab/healthcheck/check_oss.sh + timeout: 180s + interval: 10s +``` + +--- + +## 5. Healthcheck Mechanism + +### Stage 1 — SSH readiness + +Before any healthcheck from `lab.yaml` runs, the CLI polls for SSH availability on port 22 of each provisioned instance. This catches instance boot failures, user-data script crashes, and provisioning delays. Timeout is fixed at 5 minutes per instance; if SSH isn't available by then, `ddev lab create` aborts and prints the cloud console log for diagnosis. + +### Stage 2 — Service readiness + +Each entry in `healthchecks` is polled independently until it passes or times out. All entries must pass before seed scripts begin. For cluster runtimes, healthchecks run against their respective nodes using the IP map produced by Pulumi (see Section 7). + +Supported healthcheck types: + +| Type | Mechanism | When to use | +|------|-----------|-------------| +| `tcp` | TCP connection to `host:port` | Databases, message brokers, simple servers | +| `http` | HTTP GET, expect `200` (or configurable status) | REST APIs, management UIs | +| `script` | SSH + run script, expect exit code 0 | Complex readiness checks (cluster quorum, replication lag, Lustre mount state) | + +Healthcheck scripts live in `tests/lab/healthcheck/` and are part of the research phase output. + +### Failure behavior + +If any healthcheck times out, `ddev lab create` prints which check failed, shows the last N lines of the relevant service log (via SSH), and exits non-zero. Instances are left running for manual inspection. `ddev lab destroy ` still works to clean up. + +--- + +## 6. The Datadog Agent in the Lab + +Every lab provisions a Datadog Agent configured to run the integration check continuously. This is the primary artifact of the lab — the goal is to see real metrics flowing into Datadog from a live service. + +### Compose runtime + +The Agent runs as a service in `tests/lab/compose/docker-compose.yml`, generated by the research phase: + +```yaml +services: + kafka: + image: confluentinc/cp-kafka:3.7.0 + environment: + KAFKA_BROKER_ID: 1 + KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 + + zookeeper: + image: confluentinc/cp-zookeeper:3.7.0 + + datadog-agent: + image: ${DD_AGENT_IMAGE:-datadog/agent:latest} + environment: + DD_API_KEY: ${DD_API_KEY} + DD_SITE: datadoghq.com + volumes: + - ./agent/conf.yaml:/etc/datadog-agent/conf.d/kafka.d/conf.yaml:ro + depends_on: + kafka: + condition: service_healthy +``` + +The `DD_AGENT_IMAGE` environment variable defaults to `datadog/agent:latest`, allowing version overrides without editing the compose file. + +### Bare-metal and cluster runtimes + +The Agent runs as a Docker container on one of the EC2 instances (the client node for cluster runtimes). The Ansible provisioner installs Docker if not present, pulls the Agent image, and starts it with the same volume mount pattern. + +### Updating the Agent + +```bash +ddev lab upgrade --agent 7.57.0 +# Pulls datadog/agent:7.57.0 on the instance, restarts the container, verifies the check runs. + +ddev lab upgrade --integration 16.2.0 +# Updates the integration package inside the Agent container to the specified version. +``` + +`ddev lab upgrade` does not reprovision instances or re-seed data. It only restarts the Agent container with the new image or package. + +--- + +## 7. Infrastructure — Pulumi Automation API + +### Why Pulumi + +The infrastructure provisioning layer must be multi-cloud (AWS today, GCP in future sprints) and must execute on demand — not auto-applied on merge. The [Pulumi Automation API](https://www.pulumi.com/docs/using-pulumi/automation-api/) satisfies both: + +- **Multi-cloud**: the same Python program targets AWS or GCP by selecting a provider; switching is a `cloud:` field in `lab.yaml` +- **On-demand**: `ddev lab create` calls `pulumi.up()` directly via the Python SDK — no external CI/CD trigger, no separate repo with auto-apply +- **In-repo**: Pulumi programs live in `ddev/src/ddev/cli/lab/infra/` (Python), eliminating the cloud-inventory dependency +- **State in object storage**: stack state stored in S3 (`aws`) or GCS (`gcp`) under `agent-integrations-dev-lab-state/pulumi//` + +### Pulumi program structure + +``` +ddev/src/ddev/cli/lab/ + infra/ + __init__.py + base.py ← abstract LabInfra class (provision, destroy, outputs) + aws.py ← AWS provider implementation + gcp.py ← GCP provider implementation (Phase 2+) + cluster.py ← multi-node cluster logic (node roles, security groups, IP map) +``` + +The research phase generates a starter Pulumi program and places it in `infra/labs//` inside the same PR as the `tests/lab/` artifacts. A team member reviews it before merging — same review gate as any other generated artifact. + +### What the Pulumi program provisions + +**Single-instance (compose and bare-metal runtimes):** +- VM instance (EC2 or GCE) + security group (SSH + service ports, ingress from team VPN CIDR only) +- IAM instance profile / service account for Secrets Manager access +- Object storage bucket for seed artifacts and license files +- Cost-attribution tags: `team = agent-integrations`, `env = lab` + +**Cluster runtime (Lustre, Kafka, Cassandra):** +- VPC + subnet shared by all nodes +- Security group with `lnet_port` (or equivalent) open between all instances in the group +- Instances launched per role (from `lab.yaml` nodes list) +- Outputs a `{role → [ip, ...]}` map used by subsequent steps + +### Cluster provisioning flow (Lustre example) + +Because LNET configuration requires each node to know the actual IP addresses of other nodes, provisioning follows a two-phase approach: + +``` +1. pulumi up → launch all nodes, emit {role → [ip, ...]} map +2. Render Ansible inventory + vars from IP map +3. ansible-playbook install_lustre_cluster.yml (all nodes in parallel where possible) + Stage A: install kernel modules + lnet tools on all nodes + Stage B: configure lnet.conf with actual NIDs on all nodes + Stage C: start MGS → start MDS → start OSS → mount client +4. Healthchecks run against per-role nodes +5. Seed scripts run on client node +6. Agent container starts on client node +``` + +The Ansible playbook receives the IP map via `--extra-vars`. Startup order (MGS before OSS before client mount) is encoded in the playbook's play sequence. + +### Registry entry + +```json +{ + "kafka": { + "owner": "david.kirov@datadoghq.com", + "cloud": "aws", + "instance_ids": ["i-0abc123"], + "instance_ips": {"primary": "10.0.1.42"}, + "pulumi_stack": "agent-integrations/kafka", + "started_at": "2026-05-05T10:30:00Z", + "last_active_at": "2026-05-07T09:15:00Z", + "tech_version": "3.7", + "agent_version": "7.56.2", + "status": "running" + } +} +``` + +--- + +## 8. CLI — `ddev lab` + +### Full command surface + +```bash +# Lab lifecycle +ddev lab create # provision → healthcheck → seed → load → Agent +ddev lab stop # stop load + services, leave instances running +ddev lab start # restart services + load on a stopped lab +ddev lab destroy # full teardown: services → pulumi destroy → deregister +ddev lab reload # re-run seed scripts without reprovisioning + +# Visibility +ddev lab list # all labs (all owners), with status +ddev lab status # instance state, service health, Agent check status +ddev lab logs [service] # tail logs from a service or the Agent +ddev lab ssh # open SSH session + +# Updates +ddev lab upgrade --agent # update Agent image +ddev lab upgrade --integration # update integration package + +# Research phase +ddev lab research # generate all artifacts +ddev lab research --update --version # update for a new tech version +``` + +### `ddev lab create` flow + +``` +1. Read /lab.yaml +2. pulumi up (Pulumi Automation API, state in S3/GCS) +3. Poll port 22 on each instance until SSH is available (5 min max) +4. For bare-metal / cluster runtime: ansible-playbook install_.yml + (cluster: with rendered IP map as extra-vars, plays run in role order) +5. For each healthcheck in parallel: poll until pass or timeout +6. For each seed script in order: ssh "bash -s" < tests/lab/seed/