From c558e4636888165e42a30a5e5a3f593231aea1df Mon Sep 17 00:00:00 2001 From: dkirov-dd Date: Tue, 12 May 2026 16:50:46 +0000 Subject: [PATCH 1/2] docs: add WIP environment setup automation design spec MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Design for `ddev lab` — an agentic framework that provisions Datadog integration environments on remote EC2/GCP with a single command. Co-Authored-By: Claude Sonnet 4.6 --- ...-05-environment-setup-automation-design.md | 559 ++++++++++++++++++ 1 file changed, 559 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-05-environment-setup-automation-design.md diff --git a/docs/superpowers/specs/2026-05-05-environment-setup-automation-design.md b/docs/superpowers/specs/2026-05-05-environment-setup-automation-design.md new file mode 100644 index 0000000000000..373a4159654ce --- /dev/null +++ b/docs/superpowers/specs/2026-05-05-environment-setup-automation-design.md @@ -0,0 +1,559 @@ +# Environment Setup Automation Design + +**Date:** 2026-05-05 +**Last revised:** 2026-05-07 +**Status:** Draft +**Scope:** `ddev lab` — agentic framework for provisioning, seeding, and load-testing Datadog integration environments on remote EC2 infrastructure + +--- + +## 1. Problem and Goals + +Setting up environments for Datadog integrations is one of the highest-friction steps in integration development. The pain concentrates in three areas: + +1. **Infrastructure complexity** — Some integrations (Oracle DB, IBM MQ, Kafka + ZooKeeper, Lustre clusters) require real VMs, licenses, or multi-node topologies that cannot run locally and require significant manual effort. +2. **Data quality** — Test datasets are minimal or absent; developers run checks against empty systems that don't reflect production behavior, masking metric collection gaps. +3. **Portability** — Environment setup knowledge is tribal. When a developer leaves or switches integrations, the environment has to be reconstructed from scratch. + +### Goals + +| Priority | Goal | +|----------|------| +| Primary | Provision a fully running integration environment on EC2 with a single command | +| Primary | Seed the environment with realistic-looking data | +| Primary | Generate continuous background load that exercises the integration's metric surface | +| Primary | Include a configured Datadog Agent in every lab, ready to run the integration check | +| Optional | Produce a human-readable record of what was set up and why | + +### Non-goals + +- Replacing `ddev env` for local unit/integration testing — `ddev lab` is for E2E and exploratory work against live infrastructure +- Supporting every integration on day one — start with the hardest ones (Oracle, IBM MQ, Kafka, Cassandra, Lustre) +- Running any part of the environment on a developer's local machine — all labs are remote + +--- + +## 2. Architecture Overview + +The system has two distinct layers: + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ AI Research Phase (runs once per integration version) │ +│ │ +│ Sources: vendor docs, Docker Hub, metadata.csv, manifest.json │ +│ (Does NOT require existing tests or compose files) │ +│ │ +│ Produces complete tests/lab/ subtree: │ +│ lab.yaml ← manifest + narrative │ +│ tests/lab/compose/ ← service topology (Docker) │ +│ tests/lab/seed/ ← numbered, idempotent scripts │ +│ tests/lab/load/ ← Locust or k6 script │ +│ tests/lab/agent/ ← Datadog Agent integration config │ +│ tests/lab/provision/ ← Ansible playbook (bare-metal only)│ +│ [starter main.tf] ← handed to cloud-inventory repo │ +└────────────────────────┬─────────────────────────────────────────┘ + │ artifacts reviewed + committed + ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ Deterministic Execution Layer (ddev lab CLI) │ +│ │ +│ create: terraform apply → wait SSH → healthchecks → │ +│ seed → load → start Agent → register │ +│ │ +│ stop: stop load → docker compose down → update registry │ +│ │ +│ destroy: stop load → stop services → terraform destroy → │ +│ deregister │ +└──────────────────────────────────────────────────────────────────┘ +``` + +No AI runs at execution time. The AI phase is a one-time investment per integration version, repeatable when the technology version changes. The execution layer is a simple deterministic pipeline any team member can run. + +--- + +## 3. The Research Phase + +### Trigger + +```bash +ddev lab research # generate all artifacts fresh +ddev lab research --update --version 3.8 # update for a new tech version +``` + +### Information sources + +The research agent has access to the following — **and nothing else**. The integration may be brand new with no existing tests or compose files, so the agent cannot rely on them. + +| Source | What it provides | +|--------|-----------------| +| `/metadata.csv` | The full set of metrics the integration collects — tells the agent what data must exist to make metric values non-zero | +| `/manifest.json` | Integration display name, tags, categories — context for documentation searches | +| Vendor documentation (WebFetch/WebSearch) | Service topology requirements, configuration, resource minimums | +| Docker Hub (WebFetch) | Canonical image names, available tags, recommended versions | + +### What the agent produces + +The agent writes a complete `tests/lab/` subtree under the integration directory: + +``` +/ + lab.yaml ← manifest (see Section 4) + tests/ + lab/ + compose/ + docker-compose.yml ← service(s) + Datadog Agent as Docker services + seed/ + 01_.sh ← numbered, idempotent, ordered + 02_.py + ... + load/ + locustfile.py ← or k6_script.js + agent/ + conf.yaml ← Datadog Agent integration config (instances, logs) + provision/ ← bare-metal runtimes only + install_.yml ← Ansible playbook +``` + +Additionally, the agent produces a **starter `main.tf`** for the cloud-inventory repo (printed to stdout or saved to a staging path) that a cloud-infrastructure team member reviews and commits to `cloud-inventory/aws/agent-integrations-dev/labs//`. + +### YAML comments as narrative + +Every non-obvious decision in `lab.yaml` and the generated scripts carries a YAML or shell comment explaining what the agent found in the documentation and why it made each choice (e.g., why a specific instance type, why a particular partition count, what metrics a given seed script targets). The human reviewer reads the comments to audit the agent's reasoning without needing to repeat the research. + +### Re-generation + +`ddev lab research --update kafka --version 3.8` diffs against the existing artifacts. The agent fetches the new version's changelog and release notes, identifies changed APIs or configuration keys, and updates the affected files. Unchanged files are left untouched. + +--- + +## 4. The `lab.yaml` Schema + +`lab.yaml` is the manifest. It declares infrastructure, healthchecks, execution order, and Agent configuration. The generated scripts it references are the actual implementation. + +```yaml +# lab.yaml — generated by `ddev lab research`, reviewed by a human before merging. +# Comments in this file and in tests/lab/ explain every non-obvious decision. + +metadata: + integration: kafka + tech_version: "3.7" + generated_at: "2026-05-05" + +infrastructure: + # All labs run on EC2 — keeps environments shareable and off developer machines. + # Terraform source lives in cloud-inventory; see Section 5. + terraform: + source: cloud-inventory/aws/agent-integrations-dev/labs/kafka + region: us-east-1 + # t3.large: Kafka broker + ZooKeeper combined require ~4 GB RAM under load. + # Upgrade to r5.xlarge if broker heap exceeds 2 GB. + instance_type: t3.large + + runtime: + # Docker Compose runs inside the EC2 instance. + # For licensed software that can't be containerized, use type: bare-metal. + type: compose + file: tests/lab/compose/docker-compose.yml + +# Each service in the topology gets its own healthcheck entry. +# Healthchecks run FROM the EC2 instance (via SSH), so "localhost" is the instance. +# Seeds and load only start after all healthchecks pass. +healthchecks: + - name: kafka-broker + type: tcp + host: localhost + port: 9092 + timeout: 120s + interval: 5s + - name: zookeeper + type: tcp + host: localhost + port: 2181 + timeout: 60s + interval: 5s + +seed: + # Seed scripts run in numbered order over SSH after all healthchecks pass. + # Scripts must be idempotent — re-running ddev lab create must not fail. + - type: script + path: tests/lab/seed/01_create_topics.sh + # Creates 10 topics with 3 partitions each to exercise kafka.partition.* metrics. + - type: script + path: tests/lab/seed/02_produce_sample_events.py + # Produces 50k events across all topics to populate consumer group lag metrics. + +load: + # Continuous background load keeps metric values non-zero during Agent check runs. + driver: locust # locust | k6 + script: tests/lab/load/locustfile.py + # 20 RPS generates stable consumer lag without saturating a t3.large broker. + target_rps: 20 + +agent: + # Datadog Agent runs as a service in docker-compose.yml (compose runtime) or as + # a standalone Docker container on the EC2 instance (bare-metal runtime). + # The image tag is intentionally mutable — use `ddev lab upgrade` to update. + image: datadog/agent:latest + config: tests/lab/agent/conf.yaml + # API key fetched from Secrets Manager at runtime; never stored in this file. + api_key_secret: agent-integrations-dev/datadog-api-key +``` + +### Bare-metal runtime example (Oracle) + +```yaml +infrastructure: + terraform: + source: cloud-inventory/aws/agent-integrations-dev/labs/oracle + region: us-east-1 + instance_type: r5.xlarge # Oracle minimum: 16 GB RAM + runtime: + type: bare-metal + # Ansible installs Oracle from the license-gated S3 bucket. + provisioner: tests/lab/provision/install_oracle.yml + # Teardown runs the teardown playbook to deregister Oracle before terraform destroy. + teardown: tests/lab/provision/teardown_oracle.yml + +agent: + image: datadog/agent:latest + config: tests/lab/agent/conf.yaml + api_key_secret: agent-integrations-dev/datadog-api-key +``` + +### Cluster runtime example (Lustre) + +```yaml +metadata: + integration: lustre + tech_version: "2.15" + +infrastructure: + terraform: + source: cloud-inventory/aws/agent-integrations-dev/labs/lustre + region: us-east-1 + # Lustre requires a minimum 3-node cluster: MGS, MDS, and OSS. + # Instance sizing from Lustre hardware guide: r6i.xlarge for MDS/OSS under test load. + instance_type: r6i.xlarge + + runtime: + type: bare-metal + provisioner: tests/lab/provision/install_lustre_cluster.yml + teardown: tests/lab/provision/teardown_lustre_cluster.yml + +healthchecks: + - name: mgs + type: script + script: tests/lab/healthcheck/check_mgs.sh + timeout: 180s + interval: 10s + - name: mds + type: script + script: tests/lab/healthcheck/check_mds.sh + timeout: 180s + interval: 10s + - name: oss + type: script + script: tests/lab/healthcheck/check_oss.sh + timeout: 180s + interval: 10s +``` + +--- + +## 5. Healthcheck Mechanism + +### Stage 1 — EC2 SSH readiness + +Before any healthcheck from `lab.yaml` runs, the CLI polls for SSH availability on port 22. This catches instance boot failures, user-data script crashes, and AMI provisioning delays. Timeout is fixed at 5 minutes; if SSH isn't available by then, `ddev lab create` aborts and prints the EC2 console log for diagnosis. + +### Stage 2 — Service readiness + +Each entry in `healthchecks` is polled independently until it passes or times out. All entries must pass before seed scripts begin. + +Supported healthcheck types: + +| Type | Mechanism | When to use | +|------|-----------|-------------| +| `tcp` | TCP connection to `host:port` | Databases, message brokers, simple servers | +| `http` | HTTP GET, expect `200` (or configurable status) | REST APIs, management UIs | +| `script` | SSH + run script, expect exit code 0 | Complex readiness checks (cluster quorum, replication lag, Lustre mount state) | + +Healthcheck scripts live in `tests/lab/healthcheck/` and are part of the research phase output. + +### Failure behavior + +If any healthcheck times out, `ddev lab create` prints which check failed, shows the last N lines of the relevant service log (via SSH), and exits non-zero. The EC2 instance is left running for manual inspection. Running `ddev lab destroy ` still works to clean up. + +--- + +## 6. The Datadog Agent in the Lab + +Every lab provisions a Datadog Agent configured to run the integration check continuously. This is the primary artifact of the lab — the goal is to see real metrics flowing into Datadog from a live service. + +### Compose runtime + +The Agent runs as a service in `tests/lab/compose/docker-compose.yml`, generated by the research phase: + +```yaml +services: + kafka: + image: confluentinc/cp-kafka:3.7.0 + environment: + KAFKA_BROKER_ID: 1 + KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 + # ... generated from vendor docs + + zookeeper: + image: confluentinc/cp-zookeeper:3.7.0 + # ... + + datadog-agent: + image: ${DD_AGENT_IMAGE:-datadog/agent:latest} + environment: + DD_API_KEY: ${DD_API_KEY} + DD_SITE: datadoghq.com + volumes: + - ./agent/conf.yaml:/etc/datadog-agent/conf.d/kafka.d/conf.yaml:ro + depends_on: + kafka: + condition: service_healthy +``` + +The `DD_AGENT_IMAGE` environment variable defaults to `datadog/agent:latest`, allowing version overrides without editing the compose file. + +### Bare-metal runtime + +The Agent runs as a Docker container on the EC2 instance alongside the bare-metal service. The Ansible provisioner installs Docker (if not present), pulls the Agent image, and starts it with the same volume mount pattern. + +### Updating the Agent + +```bash +ddev lab upgrade --agent 7.57.0 +# Pulls datadog/agent:7.57.0 on the EC2 instance, restarts the container, verifies the check runs. + +ddev lab upgrade --integration 16.2.0 +# Updates the integration package inside the Agent container to the specified version. +``` + +`ddev lab upgrade` does not reprovision the EC2 instance or re-seed data. It only restarts the Agent container with the new image or package. + +--- + +## 7. Infrastructure — Terraform in cloud-inventory + +### Prerequisites + +The CLI requires both repos to be configured in ddev: + +```bash +ddev config set repos.core /path/to/integrations-core +ddev config set repos.cloud-inventory /path/to/cloud-inventory +``` + +`ddev lab` resolves Terraform source paths from `lab.yaml` relative to the `cloud-inventory` repo root. + +### Directory layout + +``` +cloud-inventory/ + terraform-modules/ + integration-lab-ec2/ # recipe: single-instance lab + main.tf + variables.tf + outputs.tf + integration-lab-ec2-cluster/ # recipe: multi-node lab (Kafka, Lustre, Cassandra) + main.tf + variables.tf + outputs.tf + aws/ + agent-integrations-dev/ + labs/ + kafka/ + main.tf # calls integration-lab-ec2-cluster + terraform.tfvars + oracle/ + main.tf + terraform.tfvars + lustre/ + main.tf + terraform.tfvars +``` + +### What the recipe modules handle + +- EC2 instance(s) + security group (SSH + service ports, ingress from team VPN CIDR only) +- IAM instance profile for SSM access (fallback if SSH key is lost) +- S3 bucket for seed artifacts, load scripts, and license files +- CloudWatch log group for EC2 system logs +- `team = agent-integrations` and `env = lab` tags for cost attribution + +### Integration-specific configs + +Each `labs//main.tf` calls the appropriate recipe module and sets: + +- Instance type and count from `lab.yaml` +- AMI ID (Ubuntu 22.04 base, maintained by the platform team) +- Service-specific ports for the security group +- License file S3 paths (bare-metal runtimes only) + +The research phase generates a starter `main.tf`. A cloud-infrastructure team member reviews and merges it into cloud-inventory separately from the integrations-core artifacts. + +--- + +## 8. CLI — `ddev lab` + +### Full command surface + +```bash +# Lab lifecycle +ddev lab create # provision → healthcheck → seed → load → Agent +ddev lab stop # stop load + services, leave EC2 running +ddev lab start # restart services + load on a stopped lab +ddev lab destroy # full teardown: services → terraform destroy → deregister +ddev lab reload # re-run seed scripts without reprovisioning + +# Visibility +ddev lab list # all labs (all owners), with status +ddev lab status # EC2 state, service health, Agent check status +ddev lab logs [service] # tail logs from a service or the Agent +ddev lab ssh # open SSH session + +# Updates +ddev lab upgrade --agent # update Agent image +ddev lab upgrade --integration # update integration package + +# Research phase +ddev lab research # generate all artifacts +ddev lab research --update --version # update for a new tech version +``` + +### `ddev lab create` flow + +``` +1. Read /lab.yaml +2. terraform apply in cloud-inventory/aws/agent-integrations-dev/labs// +3. Poll port 22 until SSH is available (5 min max) +4. For bare-metal runtime: ansible-playbook tests/lab/provision/install_.yml +5. For each healthcheck in parallel: poll until pass or timeout +6. For each seed script in order: ssh "bash -s" < tests/lab/seed/