Skip to content

Flaky test report: committed-code failures on 2026-05-18 #269

@andrross

Description

@andrross

Summary

10 distinct tests failed against committed code (Timer and Post Merge Action builds) in the 24-hour window ending 2026-05-18T10:00 UTC. None reproduced deterministically with their original seed on a local dev machine, confirming these are timing-dependent flakes.

Reproduction Attempts

All tests were run locally on the current main branch with the exact seed from the failing CI build. 0 out of 10 reproduced (1 could not be attempted locally due to multi-version cluster requirements).

Failing Tests

Summary Table (sorted by total unique builds affected)

# Test Builds Affected First Seen Pattern Recent Build
1 FullRollingRestartIT.testFullRollingRestart 252 2024-10-11 Worsening (spike Jul 2025, resurgence Feb 2026+) 77274
2 SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase 199 2024-04-04 Worsening (chronic, major spike Nov 2025, Apr-May 2026) 77265
3 MixedClusterClientYamlTestSuiteIT.test {cluster.health/10_basic/...} 182 2024-03-25 Stable/chronic (massive spike Sep 2024, steady low rate since, uptick Apr 2026) 77225
4 ConcurrentSeqNoVersioningIT.testSeqNoCASLinearizability 116 2024-10-03 Worsening (major spike Apr 2026 — 35 builds, likely CPU-speed amplification from m7a.8xlarge migration) 77306
5 FlightMetricsTests.testComprehensiveMetrics 74 2025-07-25 Stable/chronic (~5-11 builds/month consistently since introduction) 77275
6 ClusterShardLimitIT.testOpenIndexOverLimit 46 2025-10-15 Stable (~5-9 builds/month since first appearance) 77278
7 WarmIndexSegmentReplicationIT.testShardPathDeletionWhenWarmIndexRelocate 18 2025-06-23 Worsening (spike in Apr 2026 — 7 builds, up from 0-2/month) 77235
8 WarmIndexSegmentReplicationIT.testIndexReopenClose 15 2025-03-11 Intermittent (clusters in Mar 2025, Aug 2025, Feb 2026) 77216
9 FlightClientChannelTests.testSetMessageListenerTwice 12 2025-08-21 Stable/low (~1-3 builds/month) 77276
10 ShardIndexingPressureSettingsIT.testShardIndexingPressureNodeLimitUpdateSetting 5 2024-06-03 Rare (long dormant period, reappeared Feb 2026) 77232

Detailed Findings

1. FullRollingRestartIT.testFullRollingRestart

  • Build: 77274 (Timer, main)
  • Error: java.lang.AssertionError: replica shards haven't caught up with primary expected:<18> but was:<13>
  • Seed: A76DB14919A7F493
  • Reproduced locally: No
  • First seen: 2024-10-11
  • Total builds affected: 252
  • Pattern: Worsening. Was dormant Nov 2024–Jun 2025, then exploded in Jul 2025 (105 builds). Quieted Sep 2025–Jan 2026, then resurgent Feb 2026 onward (19 builds in May 2026 alone). The segment replication variant is particularly sensitive to timing.

2. SearchRestCancellationIT.testAutomaticCancellationDuringFetchPhase

  • Build: 77265 (Timer, main)
  • Error: java.lang.AssertionError at SearchRestCancellationIT.lambda$ensureSearchTaskIsCancelled$0
  • Seed: 398A6DBD80D20844
  • Reproduced locally: No
  • First seen: 2024-04-04
  • Total builds affected: 199
  • Pattern: Chronic and worsening. Has been failing since Apr 2024. Major spike in Nov 2025 (41 builds). Currently at 27 builds in May 2026 (18 days in). The cancellation race is highly sensitive to thread scheduling.

3. MixedClusterClientYamlTestSuiteIT.test {cluster.health/10_basic/cluster health with closed index}

  • Build: 77225 (Timer, main)
  • Error: expected [2xx] status code but api [cluster.health] returned [408 Request Timeout] — cluster was red with 51 unassigned shards
  • Seed: D13505BF7CE3BF5B
  • Reproduced locally: Cannot reproduce (requires multi-version cluster with v3.6.1 nodes)
  • First seen: 2024-03-25
  • Total builds affected: 182
  • Pattern: Chronic. Massive spike in Sep 2024 (96 builds), then settled to low single digits per month. Uptick in Apr 2026 (11 builds). The failure is a cluster health timeout during mixed-version rolling upgrade.

4. ConcurrentSeqNoVersioningIT.testSeqNoCASLinearizability

  • Build: 77306 (Timer, 2.19 branch)
  • Error: java.lang.AssertionError: Must be linearizable
  • Seed: 5B36C3D32F3299E4
  • Reproduced locally: No
  • First seen: 2024-10-03
  • Total builds affected: 116
  • Pattern: Worsening. Steady at 1-6 builds/month through early 2026, then jumped to 35 builds in Apr 2026. This timing aligns with the CI runner migration to m7a.8xlarge (~Apr 15, 2026), strongly suggesting CPU-speed amplification of a latent race condition.

5. FlightMetricsTests.testComprehensiveMetrics

  • Build: 77275 (Post Merge Action)
  • Error: BindTransportException: Failed to bind to [/0:0:0:0:0:0:0:1%lo, /127.0.0.1]:PortsRange{portRange='17201'}
  • Seed: EC0F910FC35C666A
  • Reproduced locally: No
  • First seen: 2025-07-25
  • Total builds affected: 74
  • Pattern: Stable/chronic. Consistently 4-11 builds/month since introduction. The port binding failure suggests resource contention on CI runners (port not released from a prior test in time).

6. ClusterShardLimitIT.testOpenIndexOverLimit

  • Build: 77278 (Timer, main)
  • Error: IllegalStateException: Some shards are still open after the threadpool terminated. Something is leaking index readers or store references.
  • Seed: 972067C756961F1C
  • Reproduced locally: No
  • First seen: 2025-10-15
  • Total builds affected: 46
  • Pattern: Stable. Consistent 5-9 builds/month since first appearance. The shard leak during teardown suggests an async close path that doesn't complete within the shutdown timeout.

7. WarmIndexSegmentReplicationIT.testShardPathDeletionWhenWarmIndexRelocate

  • Build: 77235 (Timer, main)
  • Error: IOException: failed to read (path in temp directory)
  • Seed: 41BCBBC845070506
  • Reproduced locally: No
  • First seen: 2025-06-23
  • Total builds affected: 18
  • Pattern: Worsening. Was 0-2 builds/month, then jumped to 7 in Apr 2026. Likely CPU-speed amplification causing a race between shard relocation and path deletion.

8. WarmIndexSegmentReplicationIT.testIndexReopenClose

  • Build: 77216 (Post Merge Action)
  • Error: AssertionError: Expected: a value equal to or greater than <4L> but: <0L> was less than <4L>
  • Seed: 27483C47C34980DA
  • Reproduced locally: No
  • First seen: 2025-03-11
  • Total builds affected: 15
  • Pattern: Intermittent. Appears in clusters (Mar 2025, Aug 2025, Feb 2026, May 2026) then goes quiet. Likely a race in segment replication state after index close/reopen.

9. FlightClientChannelTests.testSetMessageListenerTwice

  • Build: 77276 (Timer, main)
  • Error: BindTransportException: Failed to bind to [/0:0:0:0:0:0:0:1%lo, /127.0.0.1]:PortsRange{portRange='27501'}
  • Seed: 49CAD98AABBE8C7A
  • Reproduced locally: No
  • First seen: 2025-08-21
  • Total builds affected: 12
  • Pattern: Stable/low. 1-3 builds/month. Same port binding issue as FlightMetricsTests — likely the same root cause (port contention on CI).

10. ShardIndexingPressureSettingsIT.testShardIndexingPressureNodeLimitUpdateSetting

  • Build: 77232 (Timer, main)
  • Error: AssertionError: expected:<23576> but was:<47152> (value is exactly 2x expected)
  • Seed: 8DFF60F5474F9992
  • Reproduced locally: No
  • First seen: 2024-06-03
  • Total builds affected: 5
  • Pattern: Rare. Only 5 builds total over 2 years. Was dormant Jun 2024–Jan 2026, reappeared Feb 2026. The 2x value suggests a double-application of a setting update, likely a race in cluster state application.

Environmental Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions