From 17cece765d7c597a0a362899c628d693fac05245 Mon Sep 17 00:00:00 2001 From: Max Cruz Date: Wed, 13 May 2026 14:49:49 -0400 Subject: [PATCH 1/4] monitor label actions --- flaky-tests/detection/README.md | 22 +++++++------ .../detection/failure-count-monitor.md | 6 +++- flaky-tests/detection/failure-rate-monitor.md | 6 +++- .../detection/pass-on-retry-monitor.md | 8 +++-- flaky-tests/management/test-labels.md | 31 +++++++++++++++---- 5 files changed, 53 insertions(+), 20 deletions(-) diff --git a/flaky-tests/detection/README.md b/flaky-tests/detection/README.md index 199772f0..536bf3f8 100644 --- a/flaky-tests/detection/README.md +++ b/flaky-tests/detection/README.md @@ -4,21 +4,23 @@ description: Learn how Trunk detects and labels flaky and broken tests # Flake Detection -Flake Detection automatically identifies problematic tests in your test suite by monitoring test behavior over time. Instead of a single set of built-in detection rules, Trunk uses **monitors**, independent detectors that each watch for a specific pattern. When any monitor flags a test, it's marked as flaky or broken. When all monitors agree the test has recovered, it returns to healthy. +Flake Detection automatically identifies problematic tests in your test suite by monitoring test behavior over time. Instead of a single set of built-in detection rules, Trunk uses **monitors**, independent detectors that each watch for a specific pattern. When a monitor activates on a test, it runs the **action** you configured on the monitor — either classifying the test as flaky or broken, or [applying labels](../management/test-labels.md#automatic-labeling-from-monitors) to it. ## How Monitors Work -Each monitor independently observes your test runs and tracks two states per test: **active** (problematic behavior detected) or **inactive** (no problematic behavior). A test's overall status is determined by combining all of its monitors, with the most severe status winning: +Each monitor independently observes your test runs and tracks two states per test: **active** (problematic behavior detected) or **inactive** (no problematic behavior). When a monitor transitions to active, it executes its configured action; when it resolves, it undoes that action (restoring health status, or removing the labels it applied). + +For monitors whose action is **Classify test status**, the test's overall status is determined by combining all such monitors, with the most severe status winning: | Priority | Status | Condition | |----------|--------|-----------| | Highest | **Broken** | Any enabled broken-type monitor (failure rate or failure count) is active for this test | | Middle | **Flaky** | Any enabled flaky-type monitor (failure rate, failure count, or pass-on-retry) is active | -| Lowest | **Healthy** | No active monitors | +| Lowest | **Healthy** | No active classifying monitor | -If a test triggers both a broken monitor and a flaky monitor simultaneously, it shows as **Broken**. When the broken monitor resolves (e.g., you fix the regression and the failure rate drops), the test transitions to **Flaky** if a flaky monitor is still active, or to **Healthy** if no monitors remain active. +If a test triggers both a broken monitor and a flaky monitor simultaneously, it shows as **Broken**. When the broken monitor resolves (e.g., you fix the regression and the failure rate drops), the test transitions to **Flaky** if a flaky monitor is still active, or to **Healthy** if no classifying monitors remain active. -A test stays in its detected state until every relevant monitor that flagged it has independently resolved. +A test stays in its detected state until every classifying monitor that flagged it has independently resolved. Monitors configured to apply labels do not contribute to this status calculation — they only add or remove labels. ### Disabling or Deleting a Monitor @@ -28,11 +30,11 @@ For example, if you have a broken failure rate monitor and a flaky pass-on-retry ## Monitor Types -| Monitor | What it detects | Detection type | Plan availability | Default state | +| Monitor | What it detects | Available actions | Plan availability | Default state | |---|---|---|---|---| -| [**Pass-on-Retry**](pass-on-retry-monitor.md) | A test fails then passes on the same commit (retry after failure) | Flaky | Team and above | Enabled | -| [**Failure Rate**](failure-rate-monitor.md) | Failure rate exceeds a configured percentage over a time window | Flaky or Broken | Paid plans | Disabled | -| [**Failure Count**](failure-count-monitor.md) | A test accumulates a configured number of failures in a rolling window | Flaky or Broken | Paid plans | Disabled | +| [**Pass-on-Retry**](pass-on-retry-monitor.md) | A test fails then passes on the same commit (retry after failure) | Classify (flaky) or [apply labels](../management/test-labels.md#automatic-labeling-from-monitors) | Team and above | Enabled | +| [**Failure Rate**](failure-rate-monitor.md) | Failure rate exceeds a configured percentage over a time window | Classify (flaky or broken) or [apply labels](../management/test-labels.md#automatic-labeling-from-monitors) | Paid plans | Disabled | +| [**Failure Count**](failure-count-monitor.md) | A test accumulates a configured number of failures in a rolling window | Classify (flaky or broken) or [apply labels](../management/test-labels.md#automatic-labeling-from-monitors) | Paid plans | Disabled | You can run multiple monitors simultaneously. For example, you might use pass-on-retry to catch classic retry-based flakiness while also running failure rate monitors scoped to different branches. A common pattern is to pair a broken-type failure rate monitor (catching consistently failing tests) with a flaky-type failure rate monitor (catching intermittently failing tests). See [Failure Rate Monitor: Recommended Configurations](failure-rate-monitor.md#recommended-configurations) for details. @@ -66,7 +68,7 @@ You can mute a monitor from the test case view in the Trunk app. When muting, yo | 7 days | | 30 days | -While muted, the monitor is excluded from the test's status calculation. If the muted monitor was the only active monitor, the test transitions from flaky to healthy for the duration of the mute. When the mute expires, the monitor is automatically included in the next status evaluation. If it's still active, the test will be flagged as flaky again. +While muted, the monitor is excluded from the test's status calculation. If the muted monitor was the only active classifying monitor, the test transitions from flaky to healthy for the duration of the mute. When the mute expires, the monitor is automatically included in the next status evaluation. If it's still active, the test will be flagged again. You can also unmute a monitor early from the test case view. diff --git a/flaky-tests/detection/failure-count-monitor.md b/flaky-tests/detection/failure-count-monitor.md index 89a17a5c..62888fa3 100644 --- a/flaky-tests/detection/failure-count-monitor.md +++ b/flaky-tests/detection/failure-count-monitor.md @@ -20,7 +20,7 @@ If you need to detect patterns of intermittent failure over time (e.g., a test t ## Detection Type -Each failure count monitor has a **detection type** -- either **flaky** or **broken** -- which controls what status a test receives when the monitor flags it: +Each failure count monitor has a **detection type** -- either **flaky** or **broken** -- which controls what status a test receives when the monitor flags it. The detection type applies when the monitor's [action](#action) is **Classify test status** (the default). If you switch the action to **Apply labels** instead, the detection type is unused. - **Flaky monitors** are appropriate when failures on the monitored branch are likely non-deterministic. A test that fails once on `main` but passes on retry is probably flaky. - **Broken monitors** are appropriate when failures indicate a real regression. If a test fails on `main` and you expect it to keep failing until someone fixes it, broken is the right classification. @@ -102,6 +102,10 @@ Once the monitor configuration produces detections, the panel shows a **Failing You can search the list by test name or parent test name. The search is case-insensitive and filters as you type. If no tests match your search term, the list shows a "No tests match" message. When more than 100 tests are detected, only the first 100 are shown with a notice to narrow your search. +## Action + +By default, an active failure count monitor classifies the test according to its [detection type](#detection-type) (flaky or broken) and restores it to healthy on resolution. You can switch the monitor's action to **Apply labels** instead — see [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). + ## Muting You can temporarily mute a failure count monitor for a specific test case. See [Muting monitors](README.md#muting-monitors) for details. diff --git a/flaky-tests/detection/failure-rate-monitor.md b/flaky-tests/detection/failure-rate-monitor.md index 2abac7f3..c8972102 100644 --- a/flaky-tests/detection/failure-rate-monitor.md +++ b/flaky-tests/detection/failure-rate-monitor.md @@ -10,7 +10,7 @@ You can create multiple failure rate monitors with different configurations. Thi ## Detection Type -Each failure rate monitor has a **detection type** — either **flaky** or **broken** — which controls what status a test receives when the monitor flags it: +Each failure rate monitor has a **detection type** — either **flaky** or **broken** — which controls what status a test receives when the monitor flags it. The detection type applies when the monitor's [action](#action) is **Classify test status** (the default). If you switch the action to **Apply labels** instead, the detection type is unused. - **Flaky monitors** catch tests that fail intermittently (e.g., 20–50% failure rate). These are typically caused by timing issues, shared state, or non-deterministic behavior. - **Broken monitors** catch tests that fail consistently at a high rate (e.g., 80%+ failure rate). These usually indicate a real regression — something in the code or environment is genuinely broken and needs a fix. @@ -192,6 +192,10 @@ A flagged test resolves in one of two ways: Tests that are still running but haven't accumulated enough runs to meet the minimum sample size remain in their current state. They won't be resolved until there's enough data to make a determination. +## Action + +By default, an active failure rate monitor classifies the test according to its [detection type](#detection-type) (flaky or broken) and restores it to healthy on resolution. You can switch the monitor's action to **Apply labels** instead — see [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). + ## Muting You can temporarily mute a failure rate monitor for a specific test case. See [Muting monitors](README.md#muting-monitors) for details. diff --git a/flaky-tests/detection/pass-on-retry-monitor.md b/flaky-tests/detection/pass-on-retry-monitor.md index 191ee070..1258e4ec 100644 --- a/flaky-tests/detection/pass-on-retry-monitor.md +++ b/flaky-tests/detection/pass-on-retry-monitor.md @@ -10,9 +10,9 @@ By default, this monitor evaluates test runs on all branches. You can scope it t ## How It Works -The monitor continuously scans your test runs looking for commits where a test has both a failure and a success. When it finds one, the test is flagged as flaky. +The monitor continuously scans your test runs looking for commits where a test has both a failure and a success. When it finds one, the monitor activates on that test and runs its configured [action](#action): by default, the test is flagged as flaky. -Once flagged, the test remains flaky until no pass-on-retry behavior has been observed for a configurable recovery period. This prevents tests from bouncing between flaky and healthy if they only fail intermittently. +Once active, the monitor stays active on the test until no pass-on-retry behavior has been observed for a configurable recovery period. This prevents tests from bouncing between flaky and healthy if they only fail intermittently. ### Example @@ -56,6 +56,10 @@ Branch scope uses the same glob syntax as [failure rate monitor branch patterns] Changes to branch scope take effect for newly detected events. Previously detected flaky tests are not re-evaluated. +## Action + +By default, an active pass-on-retry monitor classifies the test as **flaky** and restores it to healthy on resolution. You can switch the monitor's action to **Apply labels** instead — see [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). + ## When Detection Happens Pass-on-retry detection runs continuously as new test results arrive. A failure and its corresponding retry don't need to arrive at exactly the same time. diff --git a/flaky-tests/management/test-labels.md b/flaky-tests/management/test-labels.md index 022edab1..96bb08e5 100644 --- a/flaky-tests/management/test-labels.md +++ b/flaky-tests/management/test-labels.md @@ -4,7 +4,7 @@ description: Organize and categorize test cases with organization-scoped labels. # Test Labels -Test labels are organization-scoped tags you can apply to individual test cases to organize, filter, and categorize your test suite. Labels are applied manually today; see [Automatic labeling from monitors](#automatic-labeling-from-monitors) for what's coming. +Test labels are organization-scoped tags you can apply to individual test cases to organize, filter, and categorize your test suite. Labels can be applied [manually from the test detail page](#apply-and-remove-labels-on-a-test-case) or [automatically by a monitor](#automatic-labeling-from-monitors).
Labels applied to a test on details page
@@ -13,7 +13,7 @@ Test labels are organization-scoped tags you can apply to individual test cases Labels are created, edited, and deleted at **Settings > Organization > Test Labels**. Each label has a name, an optional description, and a color used for its chip in the UI. The settings page also shows how many test cases each label is currently applied to. {% hint style="warning" %} -Deleting a label removes it from every test case it's applied to; this cannot be undone. +Deleting a label removes it from every test case it's applied to; this cannot be undone. A label that is referenced by a monitor's [label action](#automatic-labeling-from-monitors) cannot be deleted — the settings page lists the monitors that still reference it so you can clear those references first. {% endhint %}
Settings page to manage test labels
@@ -32,11 +32,30 @@ On the tests list, you can filter the table down to test cases that have a parti ### Automatic labeling from monitors -{% hint style="info" %} -**Coming soon.** Monitors will be able to automatically apply and remove labels on test cases based on test behavior. More details will be published when this is available. -{% endhint %} +The [pass-on-retry](../detection/pass-on-retry-monitor.md), [failure rate](../detection/failure-rate-monitor.md), and [failure count](../detection/failure-count-monitor.md) monitors can be configured to apply one or more labels to a test instead of classifying it as flaky or broken. Use this when you want a monitor to surface a pattern (for example, _fails on retry on PR branches_) for triage or filtering without changing the test's health status. + +#### Choose the monitor's action + +When you create or edit one of these monitors, the **Action** section asks what happens when the monitor activates: + +* **Classify test status** (the default) — marks the test as flaky or broken while the monitor is active, and restores the test to healthy when the monitor resolves. This is the original behavior. +* **Apply labels** — adds the configured labels to the test while the monitor is active. The test's health status is not changed by this monitor. + +A monitor uses one action or the other, not both. Switching an active monitor to **Apply labels** stops it from contributing to the test's health status. + +#### Configure the label action + +After selecting **Apply labels**, pick one or more labels from your organization's label set. You can create a new label inline if the one you need doesn't exist yet — the new label is added to the org-wide set in [Settings > Organization > Test Labels](#manage-labels). + +By default, the labels are removed when the monitor resolves. Turn off **Remove these labels when the monitor resolves** to keep them on the test after the monitor stops reporting. + +#### How monitor-applied labels appear + +Monitor-applied labels show up in the same places as manually applied labels: as chips on the [tests list](#filter-tests-by-label) and on the test detail page. Hovering a label tells you whether a user, one or more monitors, or a combination applied it, along with when it was first applied. + +When the same label is applied to a test by multiple sources (for example, by a user and by a monitor, or by two different monitors), the label stays on the test until every source removes it. Removing the source (such as disabling the monitor or switching its action away from **Apply labels**) clears that source's contribution on the next evaluation. ### Related * [Managing detected flaky tests](managing-detected-flaky-tests.md) — a step-by-step process for handling detected flaky tests -* [Flake Detection](../detection/) — monitors that classify tests as flaky or broken +* [Flake Detection](../detection/) — monitors that watch for problematic test behavior From 543bd686a47a4df206058330a7c93ed6441e8bca Mon Sep 17 00:00:00 2001 From: Max Cruz Date: Wed, 13 May 2026 14:57:41 -0400 Subject: [PATCH 2/4] pr feedback --- flaky-tests/detection/README.md | 2 +- .../detection/failure-count-monitor.md | 22 ++++++++----------- flaky-tests/management/test-labels.md | 2 +- 3 files changed, 11 insertions(+), 15 deletions(-) diff --git a/flaky-tests/detection/README.md b/flaky-tests/detection/README.md index 536bf3f8..d0cbce3b 100644 --- a/flaky-tests/detection/README.md +++ b/flaky-tests/detection/README.md @@ -10,7 +10,7 @@ Flake Detection automatically identifies problematic tests in your test suite by Each monitor independently observes your test runs and tracks two states per test: **active** (problematic behavior detected) or **inactive** (no problematic behavior). When a monitor transitions to active, it executes its configured action; when it resolves, it undoes that action (restoring health status, or removing the labels it applied). -For monitors whose action is **Classify test status**, the test's overall status is determined by combining all such monitors, with the most severe status winning: +For monitors whose action is **Classify test status** (referred to below as _classifying monitors_), the test's overall status is determined by combining all such monitors, with the most severe status winning: | Priority | Status | Condition | |----------|--------|-----------| diff --git a/flaky-tests/detection/failure-count-monitor.md b/flaky-tests/detection/failure-count-monitor.md index 62888fa3..e35f9bce 100644 --- a/flaky-tests/detection/failure-count-monitor.md +++ b/flaky-tests/detection/failure-count-monitor.md @@ -20,7 +20,7 @@ If you need to detect patterns of intermittent failure over time (e.g., a test t ## Detection Type -Each failure count monitor has a **detection type** -- either **flaky** or **broken** -- which controls what status a test receives when the monitor flags it. The detection type applies when the monitor's [action](#action) is **Classify test status** (the default). If you switch the action to **Apply labels** instead, the detection type is unused. +Each failure count monitor has a **detection type** — either **flaky** or **broken** — which controls what status a test receives when the monitor flags it. The detection type applies when the monitor's [action](#action) is **Classify test status** (the default). If you switch the action to **Apply labels** instead, the detection type is unused. - **Flaky monitors** are appropriate when failures on the monitored branch are likely non-deterministic. A test that fails once on `main` but passes on retry is probably flaky. - **Broken monitors** are appropriate when failures indicate a real regression. If a test fails on `main` and you expect it to keep failing until someone fixes it, broken is the right classification. @@ -102,18 +102,6 @@ Once the monitor configuration produces detections, the panel shows a **Failing You can search the list by test name or parent test name. The search is case-insensitive and filters as you type. If no tests match your search term, the list shows a "No tests match" message. When more than 100 tests are detected, only the first 100 are shown with a notice to narrow your search. -## Action - -By default, an active failure count monitor classifies the test according to its [detection type](#detection-type) (flaky or broken) and restores it to healthy on resolution. You can switch the monitor's action to **Apply labels** instead — see [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). - -## Muting - -You can temporarily mute a failure count monitor for a specific test case. See [Muting monitors](README.md#muting-monitors) for details. - -## Preview Panel - -When creating or editing a failure count monitor, a preview panel shows which tests the current configuration would flag based on recent data. - ### Status Filter A **status filter dropdown** in the preview panel lets you filter the test list to any combination of statuses: **Healthy**, **Flaky**, and **Broken**. By default, all statuses are shown. @@ -130,6 +118,14 @@ If no tests match the active filter, the empty state includes a hint to clear th For repositories with a large number of matching tests, preview results may be truncated. When this happens, an amber warning appears in the panel. The truncation applies to the list of tests shown, not to the underlying detection logic — the monitor evaluates all matching tests when active. +## Action + +By default, an active failure count monitor classifies the test according to its [detection type](#detection-type) (flaky or broken) and restores it to healthy on resolution. You can switch the monitor's action to **Apply labels** instead — see [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). + +## Muting + +You can temporarily mute a failure count monitor for a specific test case. See [Muting monitors](README.md#muting-monitors) for details. + ## Choosing Between Monitors | Scenario | Recommended monitor | diff --git a/flaky-tests/management/test-labels.md b/flaky-tests/management/test-labels.md index 96bb08e5..a683dbee 100644 --- a/flaky-tests/management/test-labels.md +++ b/flaky-tests/management/test-labels.md @@ -41,7 +41,7 @@ When you create or edit one of these monitors, the **Action** section asks what * **Classify test status** (the default) — marks the test as flaky or broken while the monitor is active, and restores the test to healthy when the monitor resolves. This is the original behavior. * **Apply labels** — adds the configured labels to the test while the monitor is active. The test's health status is not changed by this monitor. -A monitor uses one action or the other, not both. Switching an active monitor to **Apply labels** stops it from contributing to the test's health status. +A monitor uses one action or the other, not both. #### Configure the label action From 607ce7ec4c90ba4d235184c475654f6e5edf8f84 Mon Sep 17 00:00:00 2001 From: Max Cruz Date: Wed, 13 May 2026 15:39:08 -0400 Subject: [PATCH 3/4] cleanup --- flaky-tests/detection/README.md | 2 +- .../detection/failure-count-monitor.md | 32 +++++-------- flaky-tests/detection/failure-rate-monitor.md | 48 +++++++------------ .../detection/pass-on-retry-monitor.md | 8 +++- 4 files changed, 37 insertions(+), 53 deletions(-) diff --git a/flaky-tests/detection/README.md b/flaky-tests/detection/README.md index d0cbce3b..344eb919 100644 --- a/flaky-tests/detection/README.md +++ b/flaky-tests/detection/README.md @@ -24,7 +24,7 @@ A test stays in its detected state until every classifying monitor that flagged ### Disabling or Deleting a Monitor -When you disable or delete a monitor, it is immediately set to **resolved** for every test case in the repo. This triggers a status re-evaluation for all affected tests. If the disabled monitor was the only active monitor for a test, that test transitions to healthy. If other monitors are still active, the test remains in the most severe active state. +When you disable or delete a monitor, it is immediately set to **resolved** for every test case in the repo. For a classifying monitor, this triggers a status re-evaluation for all affected tests: if the disabled monitor was the only active classifying monitor for a test, that test transitions to healthy; if others are still active, the test remains in the most severe active state. For a labeling monitor, the labels it had applied are removed (subject to its **Remove these labels when the monitor resolves** setting). For example, if you have a broken failure rate monitor and a flaky pass-on-retry monitor, and you disable the broken monitor, any test that was only flagged by the broken monitor will become healthy. A test flagged by both will transition from broken to flaky (because pass-on-retry is still active). diff --git a/flaky-tests/detection/failure-count-monitor.md b/flaky-tests/detection/failure-count-monitor.md index e35f9bce..db6ef251 100644 --- a/flaky-tests/detection/failure-count-monitor.md +++ b/flaky-tests/detection/failure-count-monitor.md @@ -18,18 +18,9 @@ Use the failure count monitor when you want immediate visibility into test failu If you need to detect patterns of intermittent failure over time (e.g., a test that fails 20% of the time), use a [failure rate monitor](failure-rate-monitor.md) instead. If you want to catch tests that fail and then pass on retry within a single commit, [pass-on-retry](pass-on-retry-monitor.md) handles that automatically. -## Detection Type - -Each failure count monitor has a **detection type** — either **flaky** or **broken** — which controls what status a test receives when the monitor flags it. The detection type applies when the monitor's [action](#action) is **Classify test status** (the default). If you switch the action to **Apply labels** instead, the detection type is unused. - -- **Flaky monitors** are appropriate when failures on the monitored branch are likely non-deterministic. A test that fails once on `main` but passes on retry is probably flaky. -- **Broken monitors** are appropriate when failures indicate a real regression. If a test fails on `main` and you expect it to keep failing until someone fixes it, broken is the right classification. - -The detection type is set at creation and cannot be changed afterward. If you need to switch a monitor's type, create a new monitor with the desired type and disable the old one. - ## How It Works -The monitor counts the number of test failures on configured branches within a rolling time window. When a test reaches the configured failure count, it is flagged. +The monitor counts the number of test failures on configured branches within a rolling time window. When a test reaches the configured failure count, the monitor activates and runs its configured [action](#action) — by default, flagging the test as flaky or broken. ### Example @@ -72,7 +63,7 @@ The window should be long enough to capture the failures you care about but shor ### Resolution Timeout -How long a flagged test must go without any new failures before it is automatically resolved. This is the only way a failure count monitor resolves. There is no "recovery rate" or sample-based resolution like the failure rate monitor. +How long a flagged test must go without any new failures before it is automatically resolved. This is the only way a failure count monitor resolves — there is no "recovery rate" or sample-based resolution like the failure rate monitor, and no stale timeout. If a test stops running entirely (e.g., it was deleted or renamed), it stays flagged until the resolution timeout elapses from its last observed failure. For example, with a resolution timeout of 2 hours, a test that was flagged at 3:00 PM will resolve at 5:00 PM if no new failures occur. If a new failure arrives at 4:30 PM, the clock resets, and the test will not resolve until 6:30 PM. @@ -86,13 +77,20 @@ Which branches the monitor evaluates. You can specify branch names or glob patte Branch patterns work the same way as [failure rate monitor branch patterns](failure-rate-monitor.md#branch-pattern-syntax), including glob syntax and merge queue patterns. Refer to that section for pattern syntax, examples, and tips. -## Resolution Behavior +### Action + +What happens when the monitor activates on a test. You pick the action at creation and can switch it at any time. -A failure count monitor resolves in one way: **the test stops failing for long enough.** +#### Classify test status (default) -When the configured resolution timeout elapses without a new failure on any monitored branch, the test is resolved as healthy. There is no rate-based recovery and no stale timeout. If a test stops running entirely (e.g., it was deleted or renamed), it remains in its flagged state until the resolution timeout passes from the last observed failure. +The test's status is set according to the monitor's **detection type**, and restored to healthy when the monitor resolves. The detection type is either: -This time-based approach means you don't need to wait for enough passing runs to bring a failure rate down. Once the test is quiet, it resolves. +* **Flaky** — appropriate when failures on the monitored branch are likely non-deterministic. A test that fails once on `main` but passes on retry is probably flaky. +* **Broken** — appropriate when failures indicate a real regression. If a test fails on `main` and you expect it to keep failing until someone fixes it, broken is the right classification. + +#### Apply labels + +The configured labels are added to the test while the monitor is active. The test's health status is not changed by this monitor. See [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors) for how to configure and what to expect. ## Preview Panel @@ -118,10 +116,6 @@ If no tests match the active filter, the empty state includes a hint to clear th For repositories with a large number of matching tests, preview results may be truncated. When this happens, an amber warning appears in the panel. The truncation applies to the list of tests shown, not to the underlying detection logic — the monitor evaluates all matching tests when active. -## Action - -By default, an active failure count monitor classifies the test according to its [detection type](#detection-type) (flaky or broken) and restores it to healthy on resolution. You can switch the monitor's action to **Apply labels** instead — see [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). - ## Muting You can temporarily mute a failure count monitor for a specific test case. See [Muting monitors](README.md#muting-monitors) for details. diff --git a/flaky-tests/detection/failure-rate-monitor.md b/flaky-tests/detection/failure-rate-monitor.md index c8972102..87e5bb65 100644 --- a/flaky-tests/detection/failure-rate-monitor.md +++ b/flaky-tests/detection/failure-rate-monitor.md @@ -6,22 +6,11 @@ description: Detect flaky or broken tests based on failure rate over a configura The failure rate monitor detects tests based on failure rate over a rolling time window. Unlike pass-on-retry, which looks for a specific pattern on a single commit, the failure rate monitor identifies tests that fail too often over a period of time, even if no individual failure looks like a retry. -You can create multiple failure rate monitors with different configurations. This is how you tailor detection to different branches, test volumes, sensitivity levels, and detection types. - -## Detection Type - -Each failure rate monitor has a **detection type** — either **flaky** or **broken** — which controls what status a test receives when the monitor flags it. The detection type applies when the monitor's [action](#action) is **Classify test status** (the default). If you switch the action to **Apply labels** instead, the detection type is unused. - -- **Flaky monitors** catch tests that fail intermittently (e.g., 20–50% failure rate). These are typically caused by timing issues, shared state, or non-deterministic behavior. -- **Broken monitors** catch tests that fail consistently at a high rate (e.g., 80%+ failure rate). These usually indicate a real regression — something in the code or environment is genuinely broken and needs a fix. - -The detection type is set at creation and cannot be changed afterward. If you need to switch a monitor's type, create a new monitor with the desired type and disable the old one. - -This distinction matters because the two problems call for different responses. Flaky tests might be quarantined while you investigate the root cause. Broken tests represent real failures that should be fixed, not hidden. +You can create multiple failure rate monitors with different configurations. This is how you tailor detection to different branches, test volumes, and sensitivity levels. ## How It Works -The monitor periodically calculates the failure rate for each test within a time window you define. If the rate meets or exceeds your activation threshold and the test has enough runs to be statistically meaningful, the test is flagged as flaky or broken depending on the monitor's detection type. +The monitor periodically calculates the failure rate for each test within a time window you define. If the rate meets or exceeds your activation threshold and the test has enough runs to be statistically meaningful, the monitor activates on the test and runs its configured [action](#action) — by default, flagging the test as flaky or broken. ### Example @@ -53,10 +42,6 @@ stale timeout, and branch scope. Capture it with realistic example values filled in (e.g., "Broken on main", Broken detection type, 80% activation, 60% resolution, 6 hour window, 50 min sample, main branch). --> -### Detection Type - -Choose **Flaky** or **Broken**. This determines the status a test receives when the monitor flags it. See [Detection Type](#detection-type) above for guidance on which to use. - ### Activation Threshold The failure rate that triggers detection, expressed as a percentage. A test is flagged when its failure rate meets or exceeds this value within the time window. @@ -163,6 +148,21 @@ Show the branch pattern input with a few patterns entered (e.g., `main` and `release/*`), ideally showing the tag/chip-style UI for each pattern. --> +### Action + +What happens when the monitor activates on a test. You pick the action at creation and can switch it at any time. + +#### Classify test status (default) + +The test's status is set according to the monitor's **detection type**, and restored to healthy when the monitor resolves. The detection type is either: + +* **Flaky** — for tests that fail intermittently (e.g., 20–50% failure rate). These are typically caused by timing issues, shared state, or non-deterministic behavior. Flaky tests are often quarantined while you investigate the root cause. +* **Broken** — for tests that fail consistently at a high rate (e.g., 80%+ failure rate). These usually indicate a real regression — something in the code or environment is genuinely broken and needs a fix. Broken tests represent real failures that should be fixed, not hidden. + +#### Apply labels + +The configured labels are added to the test while the monitor is active. The test's health status is not changed by this monitor. See [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors) for how to configure and what to expect. + ## Preview Panel When creating or editing a failure rate monitor, a preview panel shows which tests the current configuration would flag based on recent data. The panel is split into two sections: **Current** and **Proposed**. @@ -182,20 +182,6 @@ When a filter is active, the info tooltip shows "X of Y tests" to indicate how m The status filter applies to the **Proposed** section. The not-in-window count in the Current section reflects the full unfiltered result set and is not affected by the filter. -## Resolution Behavior - -A flagged test resolves in one of two ways: - -**Healthy recovery:** The test's failure rate drops below the resolution threshold (or activation threshold, if no resolution threshold is set) and it still has enough runs to meet the minimum sample size. This means the test is actively running and has improved. - -**Stale recovery:** If a stale timeout is configured and the test has no runs on matching branches within that period, it resolves as stale. This is an automatic cleanup mechanism, not an indication that the test has improved. - -Tests that are still running but haven't accumulated enough runs to meet the minimum sample size remain in their current state. They won't be resolved until there's enough data to make a determination. - -## Action - -By default, an active failure rate monitor classifies the test according to its [detection type](#detection-type) (flaky or broken) and restores it to healthy on resolution. You can switch the monitor's action to **Apply labels** instead — see [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). - ## Muting You can temporarily mute a failure rate monitor for a specific test case. See [Muting monitors](README.md#muting-monitors) for details. diff --git a/flaky-tests/detection/pass-on-retry-monitor.md b/flaky-tests/detection/pass-on-retry-monitor.md index 1258e4ec..c7aee0c5 100644 --- a/flaky-tests/detection/pass-on-retry-monitor.md +++ b/flaky-tests/detection/pass-on-retry-monitor.md @@ -37,6 +37,7 @@ default 7-day recovery period visible. --> | **Enabled** | Whether the monitor is active | On | | **Recovery days** | Days without pass-on-retry behavior before a test is resolved as healthy. Range: 1 to 15 days. | 7 | | **Branch scope** | Which branches the monitor evaluates. Accepts branch names and glob patterns. | All branches (`*`) | +| **Action** | What happens when the monitor activates. Either classify the test as flaky or apply labels. | Classify as flaky | ### What Recovery Days Controls @@ -56,9 +57,12 @@ Branch scope uses the same glob syntax as [failure rate monitor branch patterns] Changes to branch scope take effect for newly detected events. Previously detected flaky tests are not re-evaluated. -## Action +### Action -By default, an active pass-on-retry monitor classifies the test as **flaky** and restores it to healthy on resolution. You can switch the monitor's action to **Apply labels** instead — see [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). +You pick the action at creation and can switch it at any time. + +* **Classify test status** (default) — flags the test as **flaky** while the monitor is active and restores it to healthy when the monitor resolves. Pass-on-retry only classifies as flaky; there is no broken option. +* **Apply labels** — adds the configured labels to the test while the monitor is active. The test's health status is not changed by this monitor. See [Automatic labeling from monitors](../management/test-labels.md#automatic-labeling-from-monitors). ## When Detection Happens From 38f94ad9f6d82e7c5591bc6a0244f74e29068d5a Mon Sep 17 00:00:00 2001 From: Max Cruz Date: Thu, 14 May 2026 13:51:14 -0400 Subject: [PATCH 4/4] add section on dry-running --- flaky-tests/detection/README.md | 13 +++++++++++++ flaky-tests/management/test-labels.md | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/flaky-tests/detection/README.md b/flaky-tests/detection/README.md index 344eb919..a64e0140 100644 --- a/flaky-tests/detection/README.md +++ b/flaky-tests/detection/README.md @@ -42,6 +42,19 @@ The [failure count monitor](failure-count-monitor.md) complements failure rate m If you need to manually flag a test that automated monitors haven't caught, use [Flag as Flaky](flag-as-flaky.md) from the test detail page. +## Dry-Running with Labels + +You can preview how a new classifying monitor would behave by deploying it as a labeling monitor first. Because **Apply labels** attaches labels without changing health status, you can let the monitor run on live test data, see which tests it activates on, refine the settings, and only flip it to **Classify test status** once you trust the configuration. + +The flow is typically: + +1. Create the monitor with **Apply labels** and a dedicated label (e.g., `would-be-flaky`). +2. Let the monitor run for a few cycles and observe which tests pick up the label. +3. Refine the settings until the labeled set matches what you want classified. +4. Switch the monitor's action to **Classify test status**. + +The Preview Panel on each monitor config form shows a static snapshot at configuration time, but a label dry-run validates the monitor against live runs without committing to a status change. + ## Branch-Aware Detection Tests often behave differently depending on where they run. Failures on `main` are usually unexpected and signal flakiness. Failures on PR branches may be expected during active development. Merge queue failures are suspicious because the code has already passed PR checks. diff --git a/flaky-tests/management/test-labels.md b/flaky-tests/management/test-labels.md index a683dbee..b4bd11fe 100644 --- a/flaky-tests/management/test-labels.md +++ b/flaky-tests/management/test-labels.md @@ -32,7 +32,7 @@ On the tests list, you can filter the table down to test cases that have a parti ### Automatic labeling from monitors -The [pass-on-retry](../detection/pass-on-retry-monitor.md), [failure rate](../detection/failure-rate-monitor.md), and [failure count](../detection/failure-count-monitor.md) monitors can be configured to apply one or more labels to a test instead of classifying it as flaky or broken. Use this when you want a monitor to surface a pattern (for example, _fails on retry on PR branches_) for triage or filtering without changing the test's health status. +The [pass-on-retry](../detection/pass-on-retry-monitor.md), [failure rate](../detection/failure-rate-monitor.md), and [failure count](../detection/failure-count-monitor.md) monitors can be configured to apply one or more labels to a test instead of classifying it as flaky or broken. Use this when you want a monitor to surface a pattern (for example, _fails on retry on PR branches_) for triage or filtering without changing the test's health status. The same setup also works as a [dry-run](../detection/README.md#dry-running-with-labels) while you tune a new monitor before flipping it to classify. #### Choose the monitor's action