Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 17 additions & 23 deletions flaky-tests/detection/failure-count-monitor.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,9 @@ Use the failure count monitor when you want immediate visibility into test failu

If you need to detect patterns of intermittent failure over time (e.g., a test that fails 20% of the time), use a [failure rate monitor](./failure-rate-monitor) instead. If you want to catch tests that fail and then pass on retry within a single commit, [pass-on-retry](./pass-on-retry-monitor) handles that automatically.

## Detection Type

Each failure count monitor has a **detection type** -- either **flaky** or **broken** -- which controls what status a test receives when the monitor flags it:

- **Flaky monitors** are appropriate when failures on the monitored branch are likely non-deterministic. A test that fails once on `main` but passes on retry is probably flaky.
- **Broken monitors** are appropriate when failures indicate a real regression. If a test fails on `main` and you expect it to keep failing until someone fixes it, broken is the right classification.

The detection type is set at creation and cannot be changed afterward. If you need to switch a monitor's type, create a new monitor with the desired type and disable the old one.

## How It Works

The monitor counts the number of test failures on configured branches within a rolling time window. When a test reaches the configured failure count, it is flagged.
The monitor counts the number of test failures on configured branches within a rolling time window. When a test reaches the configured failure count, the monitor activates and runs its configured [action](#action) — by default, flagging the test as flaky or broken.

### Example

Expand Down Expand Up @@ -70,7 +61,7 @@ The window should be long enough to capture the failures you care about but shor

### Resolution Timeout

How long a flagged test must go without any new failures before it is automatically resolved. This is the only way a failure count monitor resolves. There is no "recovery rate" or sample-based resolution like the failure rate monitor.
How long a flagged test must go without any new failures before it is automatically resolved. This is the only way a failure count monitor resolves — there is no "recovery rate" or sample-based resolution like the failure rate monitor, and no stale timeout. If a test stops running entirely (e.g., it was deleted or renamed), it stays flagged until the resolution timeout elapses from its last observed failure.

For example, with a resolution timeout of 2 hours, a test that was flagged at 3:00 PM will resolve at 5:00 PM if no new failures occur. If a new failure arrives at 4:30 PM, the clock resets, and the test will not resolve until 6:30 PM.

Expand All @@ -84,13 +75,20 @@ Which branches the monitor evaluates. You can specify branch names or glob patte

Branch patterns work the same way as [failure rate monitor branch patterns](./failure-rate-monitor#branch-pattern-syntax), including glob syntax and merge queue patterns. Refer to that section for pattern syntax, examples, and tips.

## Resolution Behavior
### Action

What happens when the monitor activates on a test. You pick the action at creation and can switch it at any time.

A failure count monitor resolves in one way: **the test stops failing for long enough.**
#### Classify test status (default)

When the configured resolution timeout elapses without a new failure on any monitored branch, the test is resolved as healthy. There is no rate-based recovery and no stale timeout. If a test stops running entirely (e.g., it was deleted or renamed), it remains in its flagged state until the resolution timeout passes from the last observed failure.
The test's status is set according to the monitor's **detection type**, and restored to healthy when the monitor resolves. The detection type is either:

This time-based approach means you don't need to wait for enough passing runs to bring a failure rate down. Once the test is quiet, it resolves.
* **Flaky** — appropriate when failures on the monitored branch are likely non-deterministic. A test that fails once on `main` but passes on retry is probably flaky.
* **Broken** — appropriate when failures indicate a real regression. If a test fails on `main` and you expect it to keep failing until someone fixes it, broken is the right classification.

#### Apply labels

The configured labels are added to the test while the monitor is active. The test's health status is not changed by this monitor. See [Automatic labeling from monitors](../management/test-labels#automatic-labeling-from-monitors) for how to configure and what to expect.

## Preview Panel

Expand All @@ -100,14 +98,6 @@ Once the monitor configuration produces detections, the panel shows a **Failing

You can search the list by test name or parent test name. The search is case-insensitive and filters as you type. If no tests match your search term, the list shows a "No tests match" message. When more than 100 tests are detected, only the first 100 are shown with a notice to narrow your search.

## Muting

You can temporarily mute a failure count monitor for a specific test case. See [Muting monitors](./index#muting-monitors) for details.

## Preview Panel

When creating or editing a failure count monitor, a preview panel shows which tests the current configuration would flag based on recent data.

### Status Filter

A **status filter dropdown** in the preview panel lets you filter the test list to any combination of statuses: **Healthy**, **Flaky**, and **Broken**. By default, all statuses are shown.
Expand All @@ -124,6 +114,10 @@ If no tests match the active filter, the empty state includes a hint to clear th

For repositories with a large number of matching tests, preview results may be truncated. When this happens, an amber warning appears in the panel. The truncation applies to the list of tests shown, not to the underlying detection logic — the monitor evaluates all matching tests when active.

## Muting

You can temporarily mute a failure count monitor for a specific test case. See [Muting monitors](./index#muting-monitors) for details.

## Choosing Between Monitors

| Scenario | Recommended monitor |
Expand Down
44 changes: 17 additions & 27 deletions flaky-tests/detection/failure-rate-monitor.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,11 @@ description: "Detect flaky or broken tests based on failure rate over a configur
---
The failure rate monitor detects tests based on failure rate over a rolling time window. Unlike pass-on-retry, which looks for a specific pattern on a single commit, the failure rate monitor identifies tests that fail too often over a period of time, even if no individual failure looks like a retry.

You can create multiple failure rate monitors with different configurations. This is how you tailor detection to different branches, test volumes, sensitivity levels, and detection types.

## Detection Type

Each failure rate monitor has a **detection type** — either **flaky** or **broken** — which controls what status a test receives when the monitor flags it:

- **Flaky monitors** catch tests that fail intermittently (e.g., 20–50% failure rate). These are typically caused by timing issues, shared state, or non-deterministic behavior.
- **Broken monitors** catch tests that fail consistently at a high rate (e.g., 80%+ failure rate). These usually indicate a real regression — something in the code or environment is genuinely broken and needs a fix.

The detection type is set at creation and cannot be changed afterward. If you need to switch a monitor's type, create a new monitor with the desired type and disable the old one.

This distinction matters because the two problems call for different responses. Flaky tests might be quarantined while you investigate the root cause. Broken tests represent real failures that should be fixed, not hidden.
You can create multiple failure rate monitors with different configurations. This is how you tailor detection to different branches, test volumes, and sensitivity levels.

## How It Works

The monitor periodically calculates the failure rate for each test within a time window you define. If the rate meets or exceeds your activation threshold and the test has enough runs to be statistically meaningful, the test is flagged as flaky or broken depending on the monitor's detection type.
The monitor periodically calculates the failure rate for each test within a time window you define. If the rate meets or exceeds your activation threshold and the test has enough runs to be statistically meaningful, the monitor activates on the test and runs its configured [action](#action) — by default, flagging the test as flaky or broken.

### Example

Expand Down Expand Up @@ -51,10 +40,6 @@ stale timeout, and branch scope. Capture it with realistic example values filled
in (e.g., "Broken on main", Broken detection type, 80% activation, 60% resolution,
6 hour window, 50 min sample, main branch). --> */}

### Detection Type

Choose **Flaky** or **Broken**. This determines the status a test receives when the monitor flags it. See [Detection Type](#detection-type) above for guidance on which to use.

### Activation Threshold

The failure rate that triggers detection, expressed as a percentage. A test is flagged when its failure rate meets or exceeds this value within the time window.
Expand Down Expand Up @@ -161,6 +146,21 @@ Show the branch pattern input with a few patterns entered (e.g.,
`main` and `release/*`), ideally showing the tag/chip-style UI for
each pattern. --> */}

### Action

What happens when the monitor activates on a test. You pick the action at creation and can switch it at any time.

#### Classify test status (default)

The test's status is set according to the monitor's **detection type**, and restored to healthy when the monitor resolves. The detection type is either:

* **Flaky** — for tests that fail intermittently (e.g., 20–50% failure rate). These are typically caused by timing issues, shared state, or non-deterministic behavior. Flaky tests are often quarantined while you investigate the root cause.
* **Broken** — for tests that fail consistently at a high rate (e.g., 80%+ failure rate). These usually indicate a real regression — something in the code or environment is genuinely broken and needs a fix. Broken tests represent real failures that should be fixed, not hidden.

#### Apply labels

The configured labels are added to the test while the monitor is active. The test's health status is not changed by this monitor. See [Automatic labeling from monitors](../management/test-labels#automatic-labeling-from-monitors) for how to configure and what to expect.

## Preview Panel

When creating or editing a failure rate monitor, a preview panel shows which tests the current configuration would flag based on recent data. The panel is split into two sections: **Current** and **Proposed**.
Expand All @@ -180,16 +180,6 @@ When a filter is active, the info tooltip shows "X of Y tests" to indicate how m

The status filter applies to the **Proposed** section. The not-in-window count in the Current section reflects the full unfiltered result set and is not affected by the filter.

## Resolution Behavior

A flagged test resolves in one of two ways:

**Healthy recovery:** The test's failure rate drops below the resolution threshold (or activation threshold, if no resolution threshold is set) and it still has enough runs to meet the minimum sample size. This means the test is actively running and has improved.

**Stale recovery:** If a stale timeout is configured and the test has no runs on matching branches within that period, it resolves as stale. This is an automatic cleanup mechanism, not an indication that the test has improved.

Tests that are still running but haven't accumulated enough runs to meet the minimum sample size remain in their current state. They won't be resolved until there's enough data to make a determination.

## Muting

You can temporarily mute a failure rate monitor for a specific test case. See [Muting monitors](./index#muting-monitors) for details.
Expand Down
Loading