adding PMM to loadtest environment#45567
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Pull request overview
Adds a new Terraform module that provisions a Percona Monitoring and Management (PMM) server in the loadtesting environment, exposes the loadtest RDS connection details so the PMM module can consume them, and bootstraps the PMM container via EC2 user-data to register the Aurora cluster for monitoring.
Changes:
- New
pmm/Terraform module: EC2 instance (Amazon Linux 2023) running PMM 2 in Docker, security group restricted to VPC + VPN on 443, IAM role with SSM, Route53 vanity record. infra/outputs.tfexports RDS endpoint, username, database name, and SG id (consumed via remote state by the new module).user-data.shinstalls Docker, runs PMM, and registers the RDS MySQL/Aurora instance withpmm-admin.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| infrastructure/loadtesting/terraform/pmm/main.tf | New PMM EC2 module: provider/backend, remote-state lookups, IAM, SG, instance, DNS. |
| infrastructure/loadtesting/terraform/pmm/user-data.sh | Bootstrap script: installs Docker, runs PMM container, registers RDS for monitoring. |
| infrastructure/loadtesting/terraform/pmm/outputs.tf | Exposes PMM private IP, instance ID, and vanity URL. |
| infrastructure/loadtesting/terraform/pmm/variables.tf | Defines aws_region variable (default us-east-2). |
| infrastructure/loadtesting/terraform/infra/outputs.tf | Adds RDS endpoint/username/db/SG outputs consumed by the PMM module. |
Comments suppressed due to low confidence (1)
infrastructure/loadtesting/terraform/pmm/user-data.sh:25
- The script runs
set -euo pipefail, so if PMM never becomes ready within 60 attempts (5 min), the loop exits without error but the subsequentpmm-admin add mysqlwill likely fail against an unready server, and there is no explicit failure if the readiness check never succeeded. Consider failing fast (e.g.,exit 1) when the readiness loop exhausts all retries instead of silently continuing.
# Wait for PMM server to be ready
echo "Waiting for PMM server to become ready..."
for i in $(seq 1 60); do
if docker exec pmm-server curl -sSf -k https://localhost:443/v1/readyz > /dev/null 2>&1; then
echo "PMM server is ready"
break
fi
echo "Attempt $i/60 - PMM server not ready yet..."
sleep 5
done
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
WalkthroughThis PR refactors PMM load-testing infrastructure from EC2-based to ECS Fargate-based deployment. It exports ALB listener ARN, hosted zone ID, and RDS connection details (endpoints, master username, database name, security group) from the shared infra module. The PMM module then provisions a containerized Fargate service with workspace-scoped DNS, secrets management via Secrets Manager, IAM task execution permissions, CloudWatch logging, and automatic RDS monitoring registration via pmm-admin CLI during container startup. The service attaches to an internal ALB with workspace-scoped host-header routing and health checks. 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
infrastructure/loadtesting/terraform/pmm/user-data.sh (2)
16-25: ⚡ Quick winAdd error handling for PMM readiness timeout.
If the PMM server fails to become ready after 60 attempts (~5 minutes), the script silently continues to the
pmm-admin addcommand, which will fail. This can make debugging difficult.
⚠️ Proposed fix to add timeout error handling# Wait for PMM server to be ready echo "Waiting for PMM server to become ready..." +ready=false for i in $(seq 1 60); do if docker exec pmm-server curl -sSf -k https://localhost:443/v1/readyz > /dev/null 2>&1; then echo "PMM server is ready" + ready=true break fi echo "Attempt $i/60 - PMM server not ready yet..." sleep 5 done + +if [ "$ready" = false ]; then + echo "ERROR: PMM server failed to become ready after 5 minutes" + docker logs pmm-server + exit 1 +fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infrastructure/loadtesting/terraform/pmm/user-data.sh` around lines 16 - 25, The readiness loop that calls "docker exec pmm-server curl -sSf -k https://localhost:443/v1/readyz" may time out silently after 60 attempts; update the script to detect when the loop finishes without a successful readiness check and fail fast: after the for/seq loop (or by checking a flag like a broken-out variable), log an explicit error message including attempt count and optionally the last curl output, and exit with a non-zero status so subsequent commands like "pmm-admin add" do not run on a non-ready PMM server; reference the existing readiness check and the downstream "pmm-admin add" invocation when implementing this change.
27-36: ⚡ Quick winVerify RDS registration success.
The
pmm-admin add mysqlcommand can fail for various reasons (network connectivity, authentication, permissions), but there's no verification of success. Failed registration means no metrics will be collected, defeating the purpose of the PMM deployment.✅ Proposed fix to verify registration
# Add the RDS MySQL instance to PMM monitoring -docker exec pmm-server pmm-admin add mysql \ +if docker exec pmm-server pmm-admin add mysql \ --server-url=https://admin:admin@localhost:443 \ --server-insecure-tls \ --username='${rds_username}' \ --password='${rds_password}' \ --host='${rds_endpoint}' \ --port=3306 \ --query-source=perfschema \ - "fleet-mysql" + "fleet-mysql"; then + echo "Successfully registered RDS instance with PMM" +else + echo "ERROR: Failed to register RDS instance with PMM" + docker exec pmm-server pmm-admin list + exit 1 +fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infrastructure/loadtesting/terraform/pmm/user-data.sh` around lines 27 - 36, The pmm-admin add mysql invocation (run via "docker exec pmm-server pmm-admin add mysql" for target "fleet-mysql") is not checked for success; update the user-data script to verify the command succeeded by inspecting its exit status and/or parsing pmm-admin output and then confirming registration with a follow-up "docker exec pmm-server pmm-admin list" (or equivalent) to ensure "fleet-mysql" appears; if the add fails, log the error output, retry a few times with backoff or exit non‑zero so Terraform/user-data can surface the failure instead of silently continuing.infrastructure/loadtesting/terraform/pmm/main.tf (1)
176-198: ⚡ Quick winEnable detailed monitoring for production observability.
Detailed monitoring provides 1-minute CloudWatch metrics instead of 5-minute intervals, enabling faster detection of performance issues and anomalies in the PMM monitoring infrastructure.
📊 Proposed fix to enable detailed monitoring
resource "aws_instance" "pmm" { ami = data.aws_ami.al2023.id instance_type = "t3.large" subnet_id = data.terraform_remote_state.shared.outputs.vpc.private_subnets[0] vpc_security_group_ids = [aws_security_group.pmm.id] iam_instance_profile = aws_iam_instance_profile.pmm.name + monitoring = true root_block_device {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infrastructure/loadtesting/terraform/pmm/main.tf` around lines 176 - 198, The aws_instance "pmm" resource lacks detailed (1-minute) CloudWatch monitoring; add the EC2 detailed monitoring flag by setting monitoring = true in the resource "aws_instance" "pmm" block (and ensure any IAM role/policy used by aws_iam_instance_profile.pmm allows CloudWatch metrics if not already configured) so the instance emits 1-minute metrics for production observability.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@infrastructure/loadtesting/terraform/pmm/main.tf`:
- Around line 176-198: The aws_instance "pmm" resource does not enforce IMDSv2
(leaving IMDSv1 allowed); update the aws_instance resource named pmm to set
metadata_options with http_tokens = "required" (and optionally
http_put_response_hop_limit = 1) so the instance requires IMDSv2; locate the
aws_instance "pmm" block and add the metadata_options stanza to enforce IMDSv2.
- Around line 189-193: The user-data template is currently being rendered with
secrets (rds_username, rds_password) which exposes credentials via EC2 metadata;
remove rds_username and rds_password from the templatefile invocation so only
non-sensitive values (e.g., rds_endpoint) are passed, and update the referenced
script user-data.sh to retrieve RDS credentials at runtime from AWS Secrets
Manager using the AWS CLI/SDK (e.g., aws secretsmanager get-secret-value)
instead of template interpolation; finally, ensure the EC2 instance profile/IAM
role attached to the instance has least-privilege SecretsManager:GetSecretValue
permission for the secret used.
In `@infrastructure/loadtesting/terraform/pmm/user-data.sh`:
- Around line 27-36: The PMM server is being registered with hardcoded
credentials "admin:admin" in user-data.sh (the docker exec pmm-server pmm-admin
add mysql --server-url=https://admin:admin@localhost:443 line); replace this by
generating a secure random password (use a random_password resource), store it
in AWS Secrets Manager (aws_secretsmanager_secret +
aws_secretsmanager_secret_version), pass the secret value into the instance
user-data as a template variable (e.g., pmm_admin_password), and update
user-data.sh to inject that password into the PMM container initialization and
into the pmm-admin --server-url (use
https://admin:${pmm_admin_password}@localhost:443) without logging the secret.
Ensure the new Terraform resources (random_password and secrets) are created and
the user-data template variable pmm_admin_password is supplied when rendering
user-data so the docker/container init uses the secure password.
---
Nitpick comments:
In `@infrastructure/loadtesting/terraform/pmm/main.tf`:
- Around line 176-198: The aws_instance "pmm" resource lacks detailed (1-minute)
CloudWatch monitoring; add the EC2 detailed monitoring flag by setting
monitoring = true in the resource "aws_instance" "pmm" block (and ensure any IAM
role/policy used by aws_iam_instance_profile.pmm allows CloudWatch metrics if
not already configured) so the instance emits 1-minute metrics for production
observability.
In `@infrastructure/loadtesting/terraform/pmm/user-data.sh`:
- Around line 16-25: The readiness loop that calls "docker exec pmm-server curl
-sSf -k https://localhost:443/v1/readyz" may time out silently after 60
attempts; update the script to detect when the loop finishes without a
successful readiness check and fail fast: after the for/seq loop (or by checking
a flag like a broken-out variable), log an explicit error message including
attempt count and optionally the last curl output, and exit with a non-zero
status so subsequent commands like "pmm-admin add" do not run on a non-ready PMM
server; reference the existing readiness check and the downstream "pmm-admin
add" invocation when implementing this change.
- Around line 27-36: The pmm-admin add mysql invocation (run via "docker exec
pmm-server pmm-admin add mysql" for target "fleet-mysql") is not checked for
success; update the user-data script to verify the command succeeded by
inspecting its exit status and/or parsing pmm-admin output and then confirming
registration with a follow-up "docker exec pmm-server pmm-admin list" (or
equivalent) to ensure "fleet-mysql" appears; if the add fails, log the error
output, retry a few times with backoff or exit non‑zero so Terraform/user-data
can surface the failure instead of silently continuing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5072db73-4c8e-4cdf-88c9-1bd0e701f040
📒 Files selected for processing (5)
infrastructure/loadtesting/terraform/infra/outputs.tfinfrastructure/loadtesting/terraform/pmm/main.tfinfrastructure/loadtesting/terraform/pmm/outputs.tfinfrastructure/loadtesting/terraform/pmm/user-data.shinfrastructure/loadtesting/terraform/pmm/variables.tf
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@infrastructure/loadtesting/terraform/pmm/main.tf`:
- Around line 88-91: The ECS task/service can start before the secret value
exists because task definitions only reference aws_secretsmanager_secret
metadata; update the ECS resource(s) that create tasks (e.g.,
aws_ecs_task_definition and/or aws_ecs_service) to add an explicit depends_on
referencing aws_secretsmanager_secret_version.pmm_admin_password (and the other
secret_version resources noted at lines 155-162 and 226-229) so Terraform will
wait for the first secret version to be created before registering or starting
tasks.
- Around line 181-213: The background subshell that performs the PATCH to
/v1/users and the pmm-admin add mysql (spawned in the container
entryPoint/command) can fail silently because it is backgrounded with "&" while
/opt/entrypoint.sh is exec'd; change the startup so the bootstrap steps run
synchronously before exec and propagate errors: run the readiness loop, run the
curl PATCH (change admin password) and the pmm-admin add mysql sequentially
without backgrounding, set the script to fail fast on any command error (e.g.,
enable errexit/pipefail behavior), log failures from the curl and pmm-admin
calls, and exit non-zero if either command fails so /opt/entrypoint.sh is not
started when bootstrap fails.
- Around line 220-233: The PMM task is using the RDS cluster master credentials
(PMM_MYSQL_USERNAME valueFrom
data.terraform_remote_state.infra.outputs.rds_cluster_master_username and
PMM_MYSQL_PASSWORD from data.aws_secretsmanager_secret.rds_password), which
gives full admin access; create a dedicated monitoring DB user with minimal
privileges required for PMM scraping, store its username/password in Secrets
Manager, and update the task env vars to reference the new secret and username
(replace PMM_MYSQL_USERNAME and PMM_MYSQL_PASSWORD references to use the new
secret ARN and username secret). Ensure you provision the DB user (via
Terraform-provisioned SQL, aws_db_instance tools, or a one-time migration) and
grant only the least privileges needed for monitoring.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2d91130a-f5de-4a01-9b74-5db442c3e1b3
📒 Files selected for processing (3)
infrastructure/loadtesting/terraform/infra/outputs.tfinfrastructure/loadtesting/terraform/pmm/main.tfinfrastructure/loadtesting/terraform/pmm/outputs.tf
🚧 Files skipped from review as they are similar to previous changes (1)
- infrastructure/loadtesting/terraform/infra/outputs.tf
| resource "aws_secretsmanager_secret_version" "pmm_admin_password" { | ||
| secret_id = aws_secretsmanager_secret.pmm_admin_password.id | ||
| secret_string = random_password.pmm_admin.result | ||
| } |
There was a problem hiding this comment.
Wait for the first admin-secret version before creating tasks.
The task definition only depends on the secret metadata, not on aws_secretsmanager_secret_version.pmm_admin_password. On a brand-new workspace ECS can start before the secret has a current value and fail to resolve PMM_ADMIN_PASSWORD.
Proposed fix
resource "aws_ecs_task_definition" "pmm" {
family = "${local.customer}-pmm"
+ depends_on = [aws_secretsmanager_secret_version.pmm_admin_password]
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = 2048Also applies to: 155-162, 226-229
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@infrastructure/loadtesting/terraform/pmm/main.tf` around lines 88 - 91, The
ECS task/service can start before the secret value exists because task
definitions only reference aws_secretsmanager_secret metadata; update the ECS
resource(s) that create tasks (e.g., aws_ecs_task_definition and/or
aws_ecs_service) to add an explicit depends_on referencing
aws_secretsmanager_secret_version.pmm_admin_password (and the other
secret_version resources noted at lines 155-162 and 226-229) so Terraform will
wait for the first secret version to be created before registering or starting
tasks.
| entryPoint = ["/bin/bash", "-c"] | ||
| command = [join("\n", [ | ||
| "# Background: wait for PMM server, then configure it", | ||
| "(", | ||
| " echo 'Waiting for PMM server to become ready...'", | ||
| " until curl -sSf -k https://localhost:443/v1/readyz > /dev/null 2>&1; do", | ||
| " sleep 5", | ||
| " done", | ||
| " echo 'PMM server is ready'", | ||
| "", | ||
| " # Change default admin password", | ||
| " curl -sSf -k -X PATCH https://localhost:443/v1/users \\", | ||
| " -H 'Content-Type: application/json' \\", | ||
| " -u admin:admin \\", | ||
| " -d \"{\\\"new_password\\\": \\\"$PMM_ADMIN_PASSWORD\\\"}\"", | ||
| " echo 'Admin password changed'", | ||
| "", | ||
| " # Add RDS MySQL monitoring", | ||
| " pmm-admin add mysql \\", | ||
| " --server-url=\"https://admin:$PMM_ADMIN_PASSWORD@localhost:443\" \\", | ||
| " --server-insecure-tls \\", | ||
| " --username=\"$PMM_MYSQL_USERNAME\" \\", | ||
| " --password=\"$PMM_MYSQL_PASSWORD\" \\", | ||
| " --host=\"$PMM_MYSQL_HOST\" \\", | ||
| " --port=3306 \\", | ||
| " --query-source=perfschema \\", | ||
| " fleet-mysql", | ||
| " echo 'MySQL monitoring configured'", | ||
| ") &", | ||
| "", | ||
| "# Run PMM server as PID 1 (foreground)", | ||
| "exec /opt/entrypoint.sh", | ||
| ])] |
There was a problem hiding this comment.
Don't let PMM bootstrap fail silently in the background.
The password change and pmm-admin add mysql both run in a background subshell. If either command fails, /opt/entrypoint.sh still stays healthy and /v1/readyz still passes, so the service looks healthy while database monitoring was never configured.
🧰 Tools
🪛 Betterleaks (1.2.0)
[high] 192-194: Discovered a potential basic authorization token provided in a curl command, which could compromise the curl accessed resource.
(curl-auth-user)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@infrastructure/loadtesting/terraform/pmm/main.tf` around lines 181 - 213, The
background subshell that performs the PATCH to /v1/users and the pmm-admin add
mysql (spawned in the container entryPoint/command) can fail silently because it
is backgrounded with "&" while /opt/entrypoint.sh is exec'd; change the startup
so the bootstrap steps run synchronously before exec and propagate errors: run
the readiness loop, run the curl PATCH (change admin password) and the pmm-admin
add mysql sequentially without backgrounding, set the script to fail fast on any
command error (e.g., enable errexit/pipefail behavior), log failures from the
curl and pmm-admin calls, and exit non-zero if either command fails so
/opt/entrypoint.sh is not started when bootstrap fails.
| { | ||
| name = "PMM_MYSQL_USERNAME" | ||
| value = data.terraform_remote_state.infra.outputs.rds_cluster_master_username | ||
| }, | ||
| ] | ||
|
|
||
| secrets = [ | ||
| { | ||
| name = "PMM_ADMIN_PASSWORD" | ||
| valueFrom = aws_secretsmanager_secret.pmm_admin_password.arn | ||
| }, | ||
| { | ||
| name = "PMM_MYSQL_PASSWORD" | ||
| valueFrom = "${data.aws_secretsmanager_secret.rds_password.arn}:password::" |
There was a problem hiding this comment.
Use a dedicated monitoring user instead of the cluster master account.
This task passes the cluster master username and password into PMM. That gives the monitoring container full administrative DB access when it only needs a narrower privilege set for scraping, which materially widens the blast radius if PMM is compromised.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@infrastructure/loadtesting/terraform/pmm/main.tf` around lines 220 - 233, The
PMM task is using the RDS cluster master credentials (PMM_MYSQL_USERNAME
valueFrom data.terraform_remote_state.infra.outputs.rds_cluster_master_username
and PMM_MYSQL_PASSWORD from data.aws_secretsmanager_secret.rds_password), which
gives full admin access; create a dedicated monitoring DB user with minimal
privileges required for PMM scraping, store its username/password in Secrets
Manager, and update the task env vars to reference the new secret and username
(replace PMM_MYSQL_USERNAME and PMM_MYSQL_PASSWORD references to use the new
secret ARN and username secret). Ensure you provision the DB user (via
Terraform-provisioned SQL, aws_db_instance tools, or a one-time migration) and
grant only the least privileges needed for monitoring.
There was a problem hiding this comment.
Don't care for load testing really.
rfairburn
left a comment
There was a problem hiding this comment.
If the ECS version is working for you end-to-end, this LGTM. I don't see a problem with the listener rule or anything else at the moment.
Taking local PMM setup from https://github.com/fleetdm/fleet/tree/main/tools/percona/pmm and adapting it to be an optional component to spin up for load tests.
Testing
Summary by CodeRabbit