Skip to content

feat: xdist#110184

Draft
joshuarli wants to merge 14 commits intomasterfrom
feat-parallel-pytest
Draft

feat: xdist#110184
joshuarli wants to merge 14 commits intomasterfrom
feat-parallel-pytest

Conversation

@joshuarli
Copy link
Member

wip

joshuarli and others added 6 commits March 6, 2026 17:07
Each pytest process now auto-acquires an exclusive slot via file locks,
giving it isolated PostgreSQL databases, Redis DBs, and Kafka topics.
This enables safe concurrent test execution without shared-state
collisions.

Key changes:
- New isolation module with slot allocation and per-resource helpers
- Session-scoped ClickHouse reset (reset_snuba) instead of per-test
- Snowflake ID preservation across Redis flushdb to prevent ID reuse
- Unique snowflake_id per worker for non-colliding model creation
- Worker-aware Kafka topics and Relay container names

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire xdist's PYTEST_XDIST_WORKER env var ("gw0", "gw1", ...) into the
isolation module so each xdist worker gets its own DB/Redis/Kafka slot.

- isolation.py: parse xdist gateway ID to numeric worker slot
- sentry.py: add pytest_xdist_setupnodes for ClickHouse reset and
  DJANGO_SETTINGS_MODULE stripping before workers spawn
- fixtures.py: skip session-scoped reset_snuba on xdist workers
- env.py: recognize PYTEST_XDIST_WORKER for in_test_environment()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scope test assertions to their own org/project IDs so concurrent
workers with shared ClickHouse tables don't see each other's data.
Add unique insert IDs to prevent ClickHouse deduplication across tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added Scope: Frontend Automatically applied to PRs that change frontend components Scope: Backend Automatically applied to PRs that change backend components labels Mar 8, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2026

🚨 Warning: This pull request contains Frontend and Backend changes!

It's discouraged to make changes to Sentry's Frontend and Backend in a single pull request. The Frontend and Backend are not atomically deployed. If the changes are interdependent of each other, they must be separated into two pull requests and be made forward or backwards compatible, such that the Backend or Frontend can be safely deployed independently.

Have questions? Please ask in the #discuss-dev-infra channel.

Remove SENTRY_PYTEST_SERIAL and SENTRY_TEST_WORKER_ID env vars.
Every pytest process now unconditionally acquires a file-lock slot,
giving it an isolated Redis DB, PostgreSQL suffix, and Kafka topics.

No configuration needed — works automatically across xdist workers,
plain pytest invocations, and concurrent runs in separate worktrees.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace xdist's default load-balancing scheduler with a deterministic
one that assigns test files to workers via round-robin and preserves
collection order within each file. This prevents test pollution and
flakiness caused by shuffled execution order.

Key design: the xdist worker protocol requires a "shutdown" command
to trigger execution of the last queued test, so the scheduler sends
all work upfront then immediately shuts down each node.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dynamic silo test classes (e.g. MyTest__InControlMode, MyTest__InRegionMode)
were created by iterating a frozenset, whose order varies across Python
processes due to hash randomization. This caused xdist workers to collect
tests in different orders, aborting the run with a collection diff error.

Sort the silo modes before creating dynamic classes so all workers see
identical collection order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Workers using set/dict-based pytest.mark.parametrize produce collections
in different orders due to hash randomization across Python processes.
The scheduler now compares sorted collections (same tests, any order)
instead of requiring identical ordering.

Also builds a canonical sorted collection for deterministic round-robin
assignment, and uses O(1) index lookup per worker instead of list.index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
joshuarli and others added 2 commits March 8, 2026 18:44
The rerunfailures plugin uses socket-based IPC between controller and
workers that hangs during worker startup, causing 60-minute CI timeouts.
Our deterministic scheduler sends all work upfront and shuts down nodes
immediately, which is incompatible with the plugin's connection model.

Disable the plugin when -n is specified. Reruns are not meaningful with
the deterministic scheduler anyway since work cannot be re-distributed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pytest-rerunfailures <=16.1 has a bug in SocketDB._sock_recv where
recv(1) returning b"" on a closed connection never matches the newline
delimiter, causing an infinite loop. The server's run_connection
threads also crash on TimeoutError which isn't caught by
suppress(ConnectionError).

Monkey-patch _sock_recv to raise ConnectionError on EOF and
TimeoutError so the existing suppression handles both cases cleanly.
This replaces the previous workaround of disabling rerunfailures
entirely under xdist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous fix handled server-side socket EOF but ConnectionError
still propagated from the client side in xdist workers, crashing the
worker process. Patch ClientStatusDB._get/_set to catch connection
errors and fall back to StatusDB no-op behavior (return 0 / no-op).

This is safe because our DeterministicScheduling doesn't support
mark_test_pending (crash reruns), and normal reruns are self-contained
within each worker's pytest_runtest_protocol loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the complex _sock_recv monkey-patch with a simple one-liner
that disables the socket mechanism entirely by setting
HAS_PYTEST_HANDLECRASHITEM = False at import time in conftest.py.

This prevents ServerStatusDB/ClientStatusDB from ever being created.
Normal within-worker reruns still work (self-contained in
pytest_runtest_protocol). Only crash-item reruns are disabled, which
our DeterministicScheduling doesn't support anyway (mark_test_pending
raises NotImplementedError).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Backend Test Failures

Failures on 0d2f8db in this run:

tests/sentry/spans/test_buffer.py::test_compression_functionality[0]log
[gw2] linux -- Python 3.13.1 /home/runner/work/sentry/sentry/.venv/bin/python3
tests/sentry/spans/test_buffer.py:828: in test_compression_functionality
    assert_clean(buffer.client)
tests/sentry/spans/test_buffer.py:126: in assert_clean
    assert not [x for x in client.keys("*") if b":hrs:" not in x]
E   AssertionError: assert not [b'snowflakeid:project_snowflake_key:826491364', b'snowflakeid:organization_snowflake_key:826491365', b'snowflakeid:te...y:826491261', b'snowflakeid:project_snowflake_key:826491235', b'snowflakeid:organization_snowflake_key:826491275', ...]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components Scope: Frontend Automatically applied to PRs that change frontend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant