Skip to content

Flaky test report: committed-code failures on 2026-05-17 #268

@andrross

Description

@andrross

Summary

4 distinct test failures were observed against committed code (Timer/Post Merge Action builds on main) in the 24 hours ending 2026-05-17. None of the failures reproduced locally with the original seed, indicating they are all timing/environment-sensitive flakes rather than deterministic bugs.

Failing Tests

1. MixedClusterClientYamlTestSuiteIT.test {p0=cluster.health/10_basic/cluster health with closed index}

Field Value
Build 77225
Trigger Timer (main)
Seed D13505BF7CE3BF5B:59613A65D21FD2A3
Reproduced locally No
First failure 2024-03-25
Total unique builds affected 212
Module qa/mixed-cluster

Error: cluster.health API returned 408 Request Timeout; cluster status was red with 51 unassigned shards during BWC rolling upgrade.

Pattern: Chronic flake since March 2024. Peaked at 58 builds in Sep 2024, then stabilized at 1-14 builds/month. Recent uptick: 10 builds in Apr 2026, 12 in May 2026 (partial month). Likely worsened by the mid-April 2026 runner migration to m7a.8xlarge (faster CPUs may cause the rolling upgrade to proceed before shards finish allocating). Stable/slightly worsening.


2. SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringFetchPhase

Field Value
Build 77182
Trigger Post Merge Action (main)
Seed 220320CBD3478DB4
Reproduced locally No
First failure 2024-03-26
Total unique builds affected 166
Module qa/smoke-test-http

Error: AssertionError in ensureSearchTaskIsCancelledassertBusy timed out waiting for the search task to be marked cancelled.

Pattern: Chronic flake since March 2024. Had a major spike in Nov 2025 (42 builds). After calming down in early 2026, it's rising again: 14 builds in Apr 2026, 22 in May 2026 (partial month). Worsening.


3. WarmIndexSegmentReplicationIT.testIndexReopenClose

Field Value
Build 77216
Trigger Post Merge Action (main)
Seed 27483C47C34980DA:C0927DDC1B8C8A14
Reproduced locally No
First failure 2025-03-11
Total unique builds affected 15
Module server (internalClusterTest)

Error: Expected: a value equal to or greater than <4L> but: <0L> was less than <4L>waitForDocs timed out; 0 docs indexed when 4 were expected after index reopen.

Pattern: Low-frequency flake since March 2025. Sporadic: 4 builds in Mar 2025, 4 in Aug 2025, 3 in Feb 2026, 2 in May 2026. Never more than 4 builds in a single month. Stable (low-rate).


4. FlightClientChannelTests.testErrorInInterimBatchFromServer

Field Value
Build 77196
Trigger Timer (main)
Seed 7612269473A008A2:D4B6774589E81D0D
Reproduced locally No
First failure 2025-07-03
Total unique builds affected 13
Module plugins/arrow-flight-rpc

Error: BindTransportException: Failed to bind to port 31001 (Address already in use) — hardcoded port conflict in test setup.

Pattern: Low-frequency flake since July 2025. Peaked at 6 builds in its first month, then 0-2 builds/month. Purely environmental (port conflict on CI runner). Stable (low-rate).


Summary Table

Test Builds Affected First Seen Trend Reproduced
MixedClusterClientYamlTestSuiteIT (cluster.health/closed index) 212 2024-03-25 Stable/slightly worsening No
SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringFetchPhase 166 2024-03-26 Worsening No
WarmIndexSegmentReplicationIT.testIndexReopenClose 15 2025-03-11 Stable (low-rate) No
FlightClientChannelTests.testErrorInInterimBatchFromServer 13 2025-07-03 Stable (low-rate) No

Reproduction Details

All tests were run locally with their CI seeds on the current main branch. None failed, confirming these are non-deterministic (timing/environment-dependent) failures. The seeds control randomized parameters but not thread scheduling, network timing, port availability, or shard allocation timing — all of which are the actual failure triggers here.

Notes

  • The April 2026 CI runner migration from m5.8xlarge to m7a.8xlarge may be amplifying timing-sensitive failures (particularly the MixedClusterClientYamlTestSuiteIT and SearchRestCancellationIT tests).
  • The FlightClientChannelTests failure is purely a port-binding conflict (hardcoded port 31001) and is unrelated to test logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions