Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 26 additions & 11 deletions flaky-tests/detection/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,42 +4,57 @@ description: Learn how Trunk detects and labels flaky and broken tests

# Flake Detection

Flake Detection automatically identifies problematic tests in your test suite by monitoring test behavior over time. Instead of a single set of built-in detection rules, Trunk uses **monitors**, independent detectors that each watch for a specific pattern. When any monitor flags a test, it's marked as flaky or broken. When all monitors agree the test has recovered, it returns to healthy.
Flake Detection automatically identifies problematic tests in your test suite by monitoring test behavior over time. Instead of a single set of built-in detection rules, Trunk uses **monitors**, independent detectors that each watch for a specific pattern. When a monitor activates on a test, it runs the **action** you configured on the monitor — either classifying the test as flaky or broken, or [applying labels](../management/test-labels.md#automatic-labeling-from-monitors) to it.

## How Monitors Work

Each monitor independently observes your test runs and tracks two states per test: **active** (problematic behavior detected) or **inactive** (no problematic behavior). A test's overall status is determined by combining all of its monitors, with the most severe status winning:
Each monitor independently observes your test runs and tracks two states per test: **active** (problematic behavior detected) or **inactive** (no problematic behavior). When a monitor transitions to active, it executes its configured action; when it resolves, it undoes that action (restoring health status, or removing the labels it applied).

For monitors whose action is **Classify test status** (referred to below as _classifying monitors_), the test's overall status is determined by combining all such monitors, with the most severe status winning:

| Priority | Status | Condition |
|----------|--------|-----------|
| Highest | **Broken** | Any enabled broken-type monitor (failure rate or failure count) is active for this test |
| Middle | **Flaky** | Any enabled flaky-type monitor (failure rate, failure count, or pass-on-retry) is active |
| Lowest | **Healthy** | No active monitors |
| Lowest | **Healthy** | No active classifying monitor |
Comment thread
max-trunk marked this conversation as resolved.

If a test triggers both a broken monitor and a flaky monitor simultaneously, it shows as **Broken**. When the broken monitor resolves (e.g., you fix the regression and the failure rate drops), the test transitions to **Flaky** if a flaky monitor is still active, or to **Healthy** if no monitors remain active.
If a test triggers both a broken monitor and a flaky monitor simultaneously, it shows as **Broken**. When the broken monitor resolves (e.g., you fix the regression and the failure rate drops), the test transitions to **Flaky** if a flaky monitor is still active, or to **Healthy** if no classifying monitors remain active.

A test stays in its detected state until every relevant monitor that flagged it has independently resolved.
A test stays in its detected state until every classifying monitor that flagged it has independently resolved. Monitors configured to apply labels do not contribute to this status calculation — they only add or remove labels.

### Disabling or Deleting a Monitor

When you disable or delete a monitor, it is immediately set to **resolved** for every test case in the repo. This triggers a status re-evaluation for all affected tests. If the disabled monitor was the only active monitor for a test, that test transitions to healthy. If other monitors are still active, the test remains in the most severe active state.
When you disable or delete a monitor, it is immediately set to **resolved** for every test case in the repo. For a classifying monitor, this triggers a status re-evaluation for all affected tests: if the disabled monitor was the only active classifying monitor for a test, that test transitions to healthy; if others are still active, the test remains in the most severe active state. For a labeling monitor, the labels it had applied are removed (subject to its **Remove these labels when the monitor resolves** setting).

For example, if you have a broken failure rate monitor and a flaky pass-on-retry monitor, and you disable the broken monitor, any test that was only flagged by the broken monitor will become healthy. A test flagged by both will transition from broken to flaky (because pass-on-retry is still active).

## Monitor Types

| Monitor | What it detects | Detection type | Plan availability | Default state |
| Monitor | What it detects | Available actions | Plan availability | Default state |
|---|---|---|---|---|
| [**Pass-on-Retry**](pass-on-retry-monitor.md) | A test fails then passes on the same commit (retry after failure) | Flaky | Team and above | Enabled |
| [**Failure Rate**](failure-rate-monitor.md) | Failure rate exceeds a configured percentage over a time window | Flaky or Broken | Paid plans | Disabled |
| [**Failure Count**](failure-count-monitor.md) | A test accumulates a configured number of failures in a rolling window | Flaky or Broken | Paid plans | Disabled |
| [**Pass-on-Retry**](pass-on-retry-monitor.md) | A test fails then passes on the same commit (retry after failure) | Classify (flaky) or [apply labels](../management/test-labels.md#automatic-labeling-from-monitors) | Team and above | Enabled |
| [**Failure Rate**](failure-rate-monitor.md) | Failure rate exceeds a configured percentage over a time window | Classify (flaky or broken) or [apply labels](../management/test-labels.md#automatic-labeling-from-monitors) | Paid plans | Disabled |
| [**Failure Count**](failure-count-monitor.md) | A test accumulates a configured number of failures in a rolling window | Classify (flaky or broken) or [apply labels](../management/test-labels.md#automatic-labeling-from-monitors) | Paid plans | Disabled |

You can run multiple monitors simultaneously. For example, you might use pass-on-retry to catch classic retry-based flakiness while also running failure rate monitors scoped to different branches. A common pattern is to pair a broken-type failure rate monitor (catching consistently failing tests) with a flaky-type failure rate monitor (catching intermittently failing tests). See [Failure Rate Monitor: Recommended Configurations](failure-rate-monitor.md#recommended-configurations) for details.

The [failure count monitor](failure-count-monitor.md) complements failure rate monitors by reacting to individual failures rather than failure rates. Use it on branches where any failure is a meaningful signal, like `main` or merge queue branches.

If you need to manually flag a test that automated monitors haven't caught, use [Flag as Flaky](flag-as-flaky.md) from the test detail page.

## Dry-Running with Labels

You can preview how a new classifying monitor would behave by deploying it as a labeling monitor first. Because **Apply labels** attaches labels without changing health status, you can let the monitor run on live test data, see which tests it activates on, refine the settings, and only flip it to **Classify test status** once you trust the configuration.

The flow is typically:

1. Create the monitor with **Apply labels** and a dedicated label (e.g., `would-be-flaky`).
2. Let the monitor run for a few cycles and observe which tests pick up the label.
3. Refine the settings until the labeled set matches what you want classified.
4. Switch the monitor's action to **Classify test status**.

The Preview Panel on each monitor config form shows a static snapshot at configuration time, but a label dry-run validates the monitor against live runs without committing to a status change.

## Branch-Aware Detection

Tests often behave differently depending on where they run. Failures on `main` are usually unexpected and signal flakiness. Failures on PR branches may be expected during active development. Merge queue failures are suspicious because the code has already passed PR checks.
Expand All @@ -66,7 +81,7 @@ You can mute a monitor from the test case view in the Trunk app. When muting, yo
| 7 days |
| 30 days |

While muted, the monitor is excluded from the test's status calculation. If the muted monitor was the only active monitor, the test transitions from flaky to healthy for the duration of the mute. When the mute expires, the monitor is automatically included in the next status evaluation. If it's still active, the test will be flagged as flaky again.
While muted, the monitor is excluded from the test's status calculation. If the muted monitor was the only active classifying monitor, the test transitions from flaky to healthy for the duration of the mute. When the mute expires, the monitor is automatically included in the next status evaluation. If it's still active, the test will be flagged again.

You can also unmute a monitor early from the test case view.

Expand Down
40 changes: 17 additions & 23 deletions flaky-tests/detection/failure-count-monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,9 @@ Use the failure count monitor when you want immediate visibility into test failu

If you need to detect patterns of intermittent failure over time (e.g., a test that fails 20% of the time), use a [failure rate monitor](failure-rate-monitor.md) instead. If you want to catch tests that fail and then pass on retry within a single commit, [pass-on-retry](pass-on-retry-monitor.md) handles that automatically.

## Detection Type

Each failure count monitor has a **detection type** -- either **flaky** or **broken** -- which controls what status a test receives when the monitor flags it:

- **Flaky monitors** are appropriate when failures on the monitored branch are likely non-deterministic. A test that fails once on `main` but passes on retry is probably flaky.
- **Broken monitors** are appropriate when failures indicate a real regression. If a test fails on `main` and you expect it to keep failing until someone fixes it, broken is the right classification.

The detection type is set at creation and cannot be changed afterward. If you need to switch a monitor's type, create a new monitor with the desired type and disable the old one.

## How It Works

The monitor counts the number of test failures on configured branches within a rolling time window. When a test reaches the configured failure count, it is flagged.
The monitor counts the number of test failures on configured branches within a rolling time window. When a test reaches the configured failure count, the monitor activates and runs its configured [action](#action) — by default, flagging the test as flaky or broken.

### Example

Expand Down Expand Up @@ -72,7 +63,7 @@ The window should be long enough to capture the failures you care about but shor

### Resolution Timeout

How long a flagged test must go without any new failures before it is automatically resolved. This is the only way a failure count monitor resolves. There is no "recovery rate" or sample-based resolution like the failure rate monitor.
How long a flagged test must go without any new failures before it is automatically resolved. This is the only way a failure count monitor resolves — there is no "recovery rate" or sample-based resolution like the failure rate monitor, and no stale timeout. If a test stops running entirely (e.g., it was deleted or renamed), it stays flagged until the resolution timeout elapses from its last observed failure.

For example, with a resolution timeout of 2 hours, a test that was flagged at 3:00 PM will resolve at 5:00 PM if no new failures occur. If a new failure arrives at 4:30 PM, the clock resets, and the test will not resolve until 6:30 PM.

Expand All @@ -86,13 +77,20 @@ Which branches the monitor evaluates. You can specify branch names or glob patte

Branch patterns work the same way as [failure rate monitor branch patterns](failure-rate-monitor.md#branch-pattern-syntax), including glob syntax and merge queue patterns. Refer to that section for pattern syntax, examples, and tips.

## Resolution Behavior
### Action

What happens when the monitor activates on a test. You pick the action at creation and can switch it at any time.

A failure count monitor resolves in one way: **the test stops failing for long enough.**
#### Classify test status (default)

When the configured resolution timeout elapses without a new failure on any monitored branch, the test is resolved as healthy. There is no rate-based recovery and no stale timeout. If a test stops running entirely (e.g., it was deleted or renamed), it remains in its flagged state until the resolution timeout passes from the last observed failure.
The test's status is set according to the monitor's **detection type**, and restored to healthy when the monitor resolves. The detection type is either:

This time-based approach means you don't need to wait for enough passing runs to bring a failure rate down. Once the test is quiet, it resolves.
* **Flaky** — appropriate when failures on the monitored branch are likely non-deterministic. A test that fails once on `main` but passes on retry is probably flaky.
* **Broken** — appropriate when failures indicate a real regression. If a test fails on `main` and you expect it to keep failing until someone fixes it, broken is the right classification.

#### Apply labels

The configured labels are added to the test while the monitor is active. The test's health status is not changed by this monitor. See [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors) for how to configure and what to expect.

## Preview Panel

Expand All @@ -102,14 +100,6 @@ Once the monitor configuration produces detections, the panel shows a **Failing

You can search the list by test name or parent test name. The search is case-insensitive and filters as you type. If no tests match your search term, the list shows a "No tests match" message. When more than 100 tests are detected, only the first 100 are shown with a notice to narrow your search.

## Muting

You can temporarily mute a failure count monitor for a specific test case. See [Muting monitors](README.md#muting-monitors) for details.

## Preview Panel

When creating or editing a failure count monitor, a preview panel shows which tests the current configuration would flag based on recent data.

### Status Filter

A **status filter dropdown** in the preview panel lets you filter the test list to any combination of statuses: **Healthy**, **Flaky**, and **Broken**. By default, all statuses are shown.
Expand All @@ -126,6 +116,10 @@ If no tests match the active filter, the empty state includes a hint to clear th

For repositories with a large number of matching tests, preview results may be truncated. When this happens, an amber warning appears in the panel. The truncation applies to the list of tests shown, not to the underlying detection logic — the monitor evaluates all matching tests when active.

## Muting

You can temporarily mute a failure count monitor for a specific test case. See [Muting monitors](README.md#muting-monitors) for details.

## Choosing Between Monitors

| Scenario | Recommended monitor |
Expand Down
Loading
Loading