Add Fleet and Kibana checks in create env workflow by seanrathier · Pull Request #4092 · elastic/cloudbeat

seanrathier · 2026-04-02T14:27:46Z

Summary of your changes

Adds a Kibana and Fleet readiness check to the CDR composite action, ensuring Fleet Server is fully operational before integration installation scripts run. Previously, integrations were installed immediately after Kibana reported available, causing race conditions where Fleet Server was still initializing.

Root causes addressed:

`GET /api/fleet/status` returns 404 on ESS deployments — replaced with `POST /api/fleet/setup` (triggers Fleet initialization) followed by stability polling on `GET /api/fleet/epm/packages`
Fleet Server passes a single readiness check but crashes under load — now requires 5 consecutive 200s before proceeding
`perform_api_call` only retried 5xx errors — 429 (TLS handshake timeout surfaced as rate limit) and `400 "not available with current configuration"` (Fleet mid-init) are now also retried
`get_package_version` silently returned `None` on failure, causing 16 install scripts to POST a null package version and receive a confusing `400 "expected string but got null"` — now raises immediately
`delete_env.sh` jq expression had an operator precedence bug causing `Cannot index string with string "StackName"` when filtering CloudFormation stacks

Files changed:

`.github/actions/cdr/action.yml` — new "Wait for Kibana and Fleet to be ready" step: Kibana available → `POST /api/fleet/setup` → 5× consecutive `GET /api/fleet/epm/packages` 200; adds `serverless-mode` input to skip the Fleet check on serverless deployments
`.github/workflows/test-environment.yml` — passes `serverless-mode` through to the CDR action
`tests/fleet_api/base_call_api.py` — retries 429 and the Fleet-not-ready 400; increases default `max_retries` from 3 to 8 with backoff capped at 30s
`tests/fleet_api/common_api.py` — `get_package_version` raises on failure instead of returning `None`
`deploy/test-environments/delete_env.sh` — fixes jq operator precedence bug in CloudFormation stack filtering

Screenshot/Data

Related Issues

Checklist

I have added tests that prove my fix is effective or that my feature works
I have added the necessary README/documentation (if appropriate)

Introducing a new rule?

Generate rule metadata using this script
Add relevant unit tests
Generate relevant rule templates using this script, and open a PR in elastic/packages/cloud_security_posture

mergify · 2026-04-02T14:28:28Z

This pull request does not have a backport label. Could you fix it @seanrathier? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

The /api/fleet/status endpoint returns 404 on ESS deployments running Kibana 9.x. Switch to /api/fleet/agent_policies which is the first endpoint the integration scripts rely on and confirms Fleet is truly ready. Also log the response body on non-200 to aid future debugging, add kbn-xsrf header, and add serverless-mode input to skip the Fleet check on serverless deployments where Fleet is managed by Elastic Cloud. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

GET /api/fleet/agent_policies returning 200 only confirms Kibana's Fleet plugin is up, not that Fleet Server is ready for writes. Switch to POST /api/fleet/setup which is idempotent and only returns isInitialized:true once Fleet Server is fully configured and accepting connections. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

POST /api/fleet/setup returning isInitialized:true only confirms Fleet data is written to Elasticsearch; Fleet Server itself can still be starting up. Add a second polling stage on GET /api/fleet/epm/packages which is the first call every install script makes and requires the full Fleet Server stack to be operational before returning 200. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

A single 200 from /api/fleet/epm/packages is not enough — Fleet Server cycles through restart windows and one passing poll can be followed immediately by 502/503. Require 3 consecutive 200s (up to 60 attempts) before declaring Fleet stable. Also retry the transient 400 "not available with the current configuration" in perform_api_call — Fleet emits this during initialisation but it resolves quickly and should not kill the script on first occurrence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Require 5 consecutive 200s (up from 3) from GET /api/fleet/epm/packages before declaring Fleet Server stable — a 3-pass window was too narrow to catch restart cycles. Increase perform_api_call max_retries default from 3 to 8 and cap exponential backoff at 30s (5, 10, 20, 30, 30, 30, 30s ≈ 2.5 min), giving scripts enough time to ride through a Fleet Server restart before giving up. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

429 responses from Fleet (e.g. TLS handshake timeout surfaced as rate limit) are transient and should be retried alongside 5xx errors. get_package_version silently returned None on failure, causing all 16 install scripts to POST a null package version and receive a confusing 400 "expected string but got null". Re-raise the exception so scripts fail immediately with a clear error instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds stronger Kibana/Fleet readiness gating to the create-environment workflow to avoid Fleet initialization race conditions before running integration installation scripts.

Changes:

Adds a composite-action step to wait for Kibana availability, trigger Fleet setup, and require multiple consecutive successful Fleet EPM responses (with a serverless-mode bypass).
Expands Fleet API retry behavior (429 + Fleet-mid-init 400) and increases retry count/backoff behavior.
Makes get_package_version fail fast by re-raising API call failures; fixes CloudFormation stack filtering in delete_env.sh.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`.github/actions/cdr/action.yml`	Adds Kibana/Fleet readiness polling step and gates integration installs on its success; introduces `serverless-mode` input.
`.github/workflows/test-environment.yml`	Plumbs `serverless_mode` workflow input through to the CDR composite action.
`tests/fleet_api/base_call_api.py`	Adds retry loop/backoff and expands transient-status handling in `perform_api_call`.
`tests/fleet_api/common_api.py`	Changes `get_package_version` to re-raise `APICallException` instead of returning `None` on API failures.
`deploy/test-environments/delete_env.sh`	Updates jq filtering used for CloudFormation stack selection (operator precedence fix per PR description).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T06:57:42Z

tests/fleet_api/base_call_api.py

@@ -78,7 +83,24 @@ def perform_api_call(
    if ok_statuses is None:
        ok_statuses = (200,)


If max_retries is set to 0 (or negative), the retry loop won’t run and response will be undefined when checked later, causing an UnboundLocalError. Add an upfront guard (e.g., enforce max_retries >= 1) or initialize response and raise a clear error when no attempts are made.

Suggested change

ok_statuses = (200,)

ok_statuses = (200,)

if max_retries < 1:

raise ValueError("max_retries must be at least 1")

Copilot · 2026-04-07T06:57:43Z

tests/fleet_api/base_call_api.py


-    response = requests.request(method=method, url=url, headers=headers, auth=auth, **params)
+    for attempt in range(max_retries):
+        response = requests.request(method=method, url=url, headers=headers, auth=auth, **params)


requests.request(...) can raise requests.exceptions.RequestException (DNS/connection/timeouts). Those exceptions currently bypass the retry loop and fail immediately. Consider catching RequestException per-attempt and treating it as transient (with the same backoff) to make the call more resilient.

Suggested change

response = requests.request(method=method, url=url, headers=headers, auth=auth, **params)

try:

response = requests.request(method=method, url=url, headers=headers, auth=auth, **params)

except requests.exceptions.RequestException as exc:

if attempt < max_retries - 1:

delay = min(retry_backoff_sec * (2**attempt), retry_backoff_max_sec)

print(

f"perform_api_call: {method} {url} raised {type(exc).__name__} "

f"(attempt {attempt + 1}/{max_retries}), retrying in {delay:.0f}s",

)

time.sleep(delay)

continue

raise APICallException(0, str(exc)) from exc

Copilot · 2026-04-07T06:57:43Z

tests/fleet_api/base_call_api.py

+            print(
+                f"perform_api_call: {method} {url} returned {response.status_code} "
+                f"(attempt {attempt + 1}/{max_retries}), retrying in {delay:.0f}s",
+            )
+            time.sleep(delay)


The retry loop logs via print(...), which is harder to control/route consistently in CI compared to the loguru logger used elsewhere in this test tooling. Consider switching this to the shared logger (or at least writing to stderr).

Copilot · 2026-04-07T06:57:43Z

.github/actions/cdr/action.yml

+            -o /tmp/kibana_status.json \
+            -w "%{http_code}" \
+            "${KIBANA_URL}/api/status")


With set -euo pipefail, a transient curl transport error (e.g., connection refused while Kibana is booting) will exit the step immediately instead of allowing the retry loop to continue. Make the curl non-fatal (e.g., || true / default HTTP_CODE=000) and consider adding connect/max timeouts.

Suggested change

-o /tmp/kibana_status.json \

-w "%{http_code}" \

"${KIBANA_URL}/api/status")

--connect-timeout 5 \

--max-time 15 \

-o /tmp/kibana_status.json \

-w "%{http_code}" \

"${KIBANA_URL}/api/status" || echo "000")

Copilot · 2026-04-07T06:57:44Z

.github/actions/cdr/action.yml

+        for i in $(seq 1 30); do
+          HTTP_CODE=$(curl -s \
+            -X POST \
+            -u "${ES_USER}:${ES_PASSWORD}" \
+            -H "Content-Type: application/json" \


Same concern as the Kibana poll: with set -e, a transient curl failure during Fleet setup will abort the step before the loop can retry. Handle curl failures explicitly (capture a sentinel HTTP code) and consider adding timeouts so this loop behaves as intended.

Copilot · 2026-04-07T06:57:44Z

.github/actions/cdr/action.yml

+        for i in $(seq 1 60); do
+          HTTP_CODE=$(curl -s \
+            -u "${ES_USER}:${ES_PASSWORD}" \
+            -H "Content-Type: application/json" \
+            -H "kbn-xsrf: true" \


This Fleet stability poll will also exit early on any non-zero curl exit code due to set -e, which defeats the purpose of requiring consecutive 200s. Treat curl failures as a non-200 attempt (e.g., HTTP_CODE=$(curl ... || echo 000)) and consider adding timeouts.

Copilot · 2026-04-07T06:57:44Z

tests/fleet_api/common_api.py

    except APICallException as api_ex:
        logger.error(
            f"API call failed, status code {api_ex.status_code}. Response: {api_ex.response_text}",
        )
-        return None
+        raise


get_package_version now re-raises on APICallException, but the documented return contract still suggests returning None on failure. Update the docstring (and ideally the return type) to reflect that failures raise so callers/type checkers don’t treat None as an expected failure value.

opauloh · 2026-04-07T13:12:52Z

I created a follow-up PR based of this PR, with some additional fixes on the create environment:

#4284

status check

73eba62

mergify bot assigned seanrathier Apr 2, 2026

seanrathier and others added 8 commits April 2, 2026 10:52

more logs

36e8e88

Merge branch 'main' into seanrathier/cdr-kibana-readiness-check

fdcaa50

seanrathier requested a review from gurevichdmitry April 2, 2026 20:41

seanrathier changed the title ~~status check~~ Add Fleet and Kibana checks in create env workflow Apr 2, 2026

seanrathier marked this pull request as ready for review April 6, 2026 19:15

seanrathier requested a review from a team as a code owner April 6, 2026 19:15

seanrathier enabled auto-merge (squash) April 6, 2026 19:15

seanrathier disabled auto-merge April 6, 2026 19:15

seanrathier enabled auto-merge (squash) April 6, 2026 19:15

uri-weisman requested a review from Copilot April 7, 2026 06:52

Copilot started reviewing on behalf of uri-weisman April 7, 2026 06:52 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

opauloh mentioned this pull request Apr 7, 2026

[Create Environment] Follow up fixes on the EA environment with CDR logs #4284

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Fleet and Kibana checks in create env workflow#4092

Add Fleet and Kibana checks in create env workflow#4092
seanrathier wants to merge 9 commits intomainfrom
seanrathier/cdr-kibana-readiness-check

seanrathier commented Apr 2, 2026 •

edited

Loading

Uh oh!

mergify bot commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

opauloh commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -78,7 +83,24 @@ def perform_api_call(
		if ok_statuses is None:
		ok_statuses = (200,)

-        response = requests.request(method=method, url=url, headers=headers, auth=auth, **params)
+        try:
+            response = requests.request(method=method, url=url, headers=headers, auth=auth, **params)
+        except requests.exceptions.RequestException as exc:
+            if attempt < max_retries - 1:
+                delay = min(retry_backoff_sec * (2**attempt), retry_backoff_max_sec)
+                print(
+                    f"perform_api_call: {method} {url} raised {type(exc).__name__} "
+                    f"(attempt {attempt + 1}/{max_retries}), retrying in {delay:.0f}s",
+                )
+                time.sleep(delay)
+                continue
+            raise APICallException(0, str(exc)) from exc

Conversation

seanrathier commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of your changes

Screenshot/Data

Related Issues

Checklist

Introducing a new rule?

Uh oh!

mergify bot commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

opauloh commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seanrathier commented Apr 2, 2026 •

edited

Loading