Skip to content

PCSM-286: Cross-version replication support (7.0 to 8.0)#199

Merged
chupe merged 12 commits intomainfrom
feature/PCSM-286-cross-version-7to8
Apr 21, 2026
Merged

PCSM-286: Cross-version replication support (7.0 to 8.0)#199
chupe merged 12 commits intomainfrom
feature/PCSM-286-cross-version-7to8

Conversation

@chupe
Copy link
Copy Markdown
Collaborator

@chupe chupe commented Mar 25, 2026

Summary

Fixed cross-version replication (7.0 to 8.0) by correcting DDL error propagation in doModify() that was silently swallowing errors and causing cascading failures. Added startup version guardrails to block unsupported downgrade scenarios and handle IndexOptionsConflict during TTL index finalization via drop-and-recreate fallback. Enhanced test infrastructure to support cross-version testing with separate source and target MongoDB versions.

Test Results: 7.0→8.0 E2E tests now pass 197/197 (previously 99/197 failures)

PCSM-286

Problem

Cross-version replication (7.0 source to 8.0 target) fails 99 out of 197 E2E tests. The root cause is doModify() in pcsm/repl/repl.go swallowing DDL errors. When a collMod operation fails, PCSM logs the error but continues in a broken state. Subsequent DML events fail silently because the target's collection state is inconsistent, creating a cascade where eventsApplied falls far behind eventsRead (104 vs 2004) and corrupts the replication session.

A secondary issue: during finalization, restoring TTL indexes from expireAfterSeconds: 2147483647 back to real values hits IndexOptionsConflict (error 85) because MongoDB 8.0 has stricter collMod validation than 7.0.

There were no startup guardrails preventing downgrade replication (8.0 source to 7.0 target).

Solution

DDL error propagation (pcsm/repl/repl.go). doModify() now returns error instead of void. All six internal error paths return errors.Wrapf() instead of logging and swallowing. The error flows up to run(), which calls setFailed(). PCSM enters failed state on the first DDL error instead of silently accumulating broken state. This single change resolves the 99 test failures, because from a clean state the collMod operations don't actually fail in 7.0 to 8.0. The failures were cascade artifacts from accumulated corruption.

Transient DDL error handling (pcsm/repl/repl.go, mdb/errors.go). applyDDLChange() treats NamespaceNotFound, IndexNotFound, and InvalidOptions as non-fatal during change stream replay (warn + skip). These errors are benign during clone catch-up: the change stream captures DDL events for collections in transient states, and by the time PCSM replays them the target has moved past the point where the operations are meaningful. The final state converges regardless. This follows the established pattern in catalog.go (Rename, DropIndex).

IndexOptionsConflict recovery (pcsm/catalog/catalog.go). doModifyIndexOption() catches IndexOptionsConflict from collMod and falls back to dropping the index and recreating it from the IndexSpecification stored in the catalog. This fires only during finalization, once per affected TTL index.

Version guardrails (main.go). createServer() blocks startup if source major version exceeds target (downgrade) and warns on cross-version upgrade. Uses buildInfo version comparison, which works on both mongod and mongos across all supported versions.

Test infrastructure (hack/, .github/workflows/). MONGO_VERSION split into SRC_MONGO_VERSION and TGT_MONGO_VERSION with cascading defaults. RS CI matrix gets a 7.0 to 8.0 cross-version entry.

@chupe chupe force-pushed the feature/PCSM-286-cross-version-7to8 branch from 8af5d47 to e44588f Compare March 25, 2026 18:14
@chupe chupe marked this pull request as ready for review March 26, 2026 01:13
@chupe chupe requested a review from inelpandzic as a code owner March 26, 2026 01:13
Fix DDL error propagation in doModify() that silently swallowed errors,
causing cascading failures in cross-version replication. Add version
guardrails at startup (block downgrade, validate FCV). Handle
IndexOptionsConflict during finalization via drop-and-recreate.

Split MONGO_VERSION into SRC_MONGO_VERSION/TGT_MONGO_VERSION across
test infrastructure. Add 7.0->8.0 combo to CI matrix.

7.0->8.0 E2E: 197 passed, 0 failed (was 99 failures before fix).
@chupe chupe force-pushed the feature/PCSM-286-cross-version-7to8 branch 2 times, most recently from e62cfa6 to 82d9f6c Compare March 26, 2026 22:10
chupe and others added 5 commits March 27, 2026 13:29
Add IsInvalidOptions and IsIndexNotFound to the non-fatal error
guard in applyDDLChange(). During concurrent clone and change stream
replay, DDL events can arrive for collections whose state has changed
(dropped, recreated as non-capped, etc). These errors are benign
because the final state converges regardless.

Also fix topo->mdb references left over from the package rename.
Remove unused sourceVersion/targetVersion fields from server struct.
Remove redundant namespace exclusion log from repl phase (clone phase
log in clone.go already covers this). Remove threaded IndexSpecification
parameter from doModifyIndexOption, dropAndRecreateIndex now fetches
the spec via getIndexFromCatalog on demand.
@chupe
Copy link
Copy Markdown
Collaborator Author

chupe commented Mar 30, 2026

@chupe chupe force-pushed the feature/PCSM-286-cross-version-7to8 branch from 71b7c49 to 6611162 Compare March 31, 2026 20:37
Comment thread AGENTS.md
Comment thread mdb/version.go Outdated
Comment thread main.go Outdated
Comment thread pcsm/catalog/modify_index_test.go
Comment thread pcsm/repl/repl.go Outdated
Comment thread pcsm/repl/repl.go
@chupe chupe requested a review from inelpandzic April 13, 2026 20:55
@opencode-agent
Copy link
Copy Markdown

I have now reviewed all the changed code thoroughly. Here is my review:


Verdict: APPROVE WITH SUGGESTIONS

Critical Issues (blocking)

None.

Performance

No concerns. The new code paths (dropAndRecreateIndex, CheckVersionCompat) are infrequent operations — once at startup and once per affected TTL index during finalization. No hot-path allocations or missing capacity hints are introduced.

Concurrency

  1. dropAndRecreateIndex catalog lock gap (minor, non-blocking): getIndexFromCatalog acquires c.lock.RLock() and releases it before the drop+recreate proceeds. If a concurrent goroutine modifies the catalog entry between the lookup and the createIndexes call, the spec could become stale. In practice this is safe because dropAndRecreateIndex is called during finalization when the worker pool is barrier'd (line 673), so no concurrent catalog mutations occur. Worth a brief comment on the caller contract.

Idiomatic

  1. errors.Wrapf with no format args (pcsm/repl/repl.go:922,930): The two errors.Wrapf(err, "Modify changeStreamPreAndPostImages") and errors.Wrapf(err, "Modify validation") calls use Wrapf but pass no format verbs. Should be errors.Wrap for consistency with the rest of the codebase, though functionally identical.

  2. DropIndexes still swallows errors (pcsm/repl/repl.go:843-844): The DropIndexes case at line 843 still logs-and-swallows errors (lg.Error(err, ...)) without returning, which is the exact pattern this PR fixes in doModify. This is a pre-existing issue and out of scope for this PR, but worth noting for follow-up since the same cascade failure mode applies.

  3. crossVersion return value unused beyond logging (main.go:630): The boolean from CheckVersionCompat is only used for an Infof log line. If there's no behavioral difference for cross-version mode (no flag propagated to pcsm.New()), the function could simply log internally. This is fine as a forward-looking design (the bool may gate behavior later), but currently it's dead information beyond the log.

Comment Quality

  1. The new comment block at pcsm/repl/repl.go:893-898 is well-written and explains the "why" clearly. Good improvement over the previous version based on reviewer feedback.

  2. mdb/version.go:110-112 — The godoc on CheckVersionCompat is clear and accurate after the reviewer's suggestion was addressed.

Suggested Alternatives

  1. Log the skipped DDL event details: At pcsm/repl/repl.go:899-902, when a DDL error is skipped as benign, the log only includes err.Error(). Consider including the operation type and namespace for easier debugging:

    lg.With(log.NS(change.Namespace.Database, change.Namespace.Collection)).
        Warnf("Skipping benign DDL error for %s: %s", change.OperationType, err)

    The lg variable already has OpTime/Op/NS fields set (line 770), so this may already be covered if lg is the scoped logger. Verify that lg at line 900 is the one initialized at line 770 (it is — lg := loggerForEvent(change) at 770), so this point is moot. The namespace/op are already in the logger context. Good.

  2. dropAndRecreateIndex could log: This is a significant recovery action (drop + recreate an index). Consider adding an Info level log line so operators know it happened, similar to the existing pattern at line 1051 (lg.Infof("Recreated index %s on %s.%s", ...)). Currently it's silent on success.

  3. Test for IndexOptionsConflict fallback path in doModifyIndexOption: modify_index_test.go tests the happy path and the nil-spec error path of dropAndRecreateIndex, but doesn't exercise the IsIndexOptionsConflict branch in doModifyIndexOption itself. A test that triggers the actual conflict (create TTL with value X, then collMod to a conflicting option) would validate the integration between the two methods. The reviewer (inelpandzic) also raised a related point about non-unique to unique index tests — the prepareUnique test at line 158 does cover this, but the IndexOptionsConflict → fallback path remains untested end-to-end.

Overall this is a well-structured fix. The core change (making doModify return errors instead of swallowing them) is exactly right, and the transient DDL error handling in applyDDLChange follows the established pattern. The dropAndRecreateIndex fallback is appropriately scoped to IndexOptionsConflict only.

New%20session%20-%202026-04-16T12%3A43%3A55.704Z
opencode session  |  github run

@opencode-agent
Copy link
Copy Markdown

Verdict: APPROVE WITH SUGGESTIONS

Critical Issues (blocking)

None.

Performance

No concerns. All new code paths are infrequent: CheckVersionCompat runs once at startup, dropAndRecreateIndex fires once per affected TTL index during finalization. No hot-path allocations introduced.

Concurrency

  1. dropAndRecreateIndex catalog lock gap (non-blocking): getIndexFromCatalog acquires c.lock.RLock() at catalog.go:1142 and releases it before the drop+recreate commands execute. A concurrent catalog mutation could theoretically make the spec stale. The godoc comment at catalog.go:1086-1095 correctly documents the caller contract, and the current callers satisfy it (serialized change stream applier, sequential unique conversions). This is well-handled.

Idiomatic

  1. errors.Wrapf with no format verbs (pcsm/repl/repl.go:945,957): Two new errors.Wrapf calls pass static strings with no format arguments:

    return errors.Wrapf(err, "Resize capped collection")   // line 945
    return errors.Wrapf(err, "Modify view")                 // line 957

    These should be errors.Wrap for consistency with lines 922 and 930, which correctly use errors.Wrap for the same pattern. Functionally identical, but Wrap is the idiomatic choice when there are no format parameters.

  2. DropIndexes still swallows errors (pcsm/repl/repl.go:843-845): The DropIndexes case logs errors and continues (lg.Error(err, ...)) without propagating them — the exact pattern this PR fixes in doModify. This is pre-existing code (not a + line), so out of scope, but it's the same class of bug and worth a follow-up ticket since the same cascade failure mode applies.

  3. crossVersion return value (main.go:630): The boolean from CheckVersionCompat is only used for an info log. If no behavioral difference is gated on it today, this is dead information beyond logging. Fine as a forward-looking design if the bool will gate behavior later.

Comment Quality

  1. The comment block at pcsm/repl/repl.go:893-898 is well-written — explains the "why" (clone applied definitive state, stale DDL events are safely skippable) rather than restating the code. Good improvement following reviewer feedback.

  2. The dropAndRecreateIndex godoc at catalog.go:1086-1095 thoroughly documents the concurrency contract and caller assumptions. Excellent.

  3. mdb/version.go:110-112 — concise and accurate godoc on CheckVersionCompat.

Suggested Alternatives

  1. Test the IndexOptionsConflict fallback integration: modify_index_test.go tests dropAndRecreateIndex directly and tests doModifyIndexOption happy paths, but doesn't exercise the IsIndexOptionsConflict(err) branch at catalog.go:1079 within doModifyIndexOption. A test that creates a TTL index, then calls doModifyIndexOption with a property that triggers the conflict, would validate the two methods work together end-to-end.

  2. String concatenation in error message (catalog.go:1106):

    errors.New("cannot recreate index: spec is nil for " + db + "." + coll + "." + indexName)

    Consider errors.Errorf("cannot recreate index: spec is nil for %s.%s.%s", db, coll, indexName) for consistency with the errors.Wrapf format-string pattern used everywhere else in this file. Minor.

New%20session%20-%202026-04-16T22%3A22%3A00.069Z
opencode session  |  github run

- Use errors.Wrap (no format args) in doModify for changeStreamPreAndPostImages and validation modifications
- Add caller-contract godoc to dropAndRecreateIndex documenting the RLock release before drop+recreate
- Log recreated index at Info level on IndexOptionsConflict fallback for operator visibility
@chupe chupe force-pushed the feature/PCSM-286-cross-version-7to8 branch from 88413d2 to b207748 Compare April 16, 2026 23:42
@opencode-agent
Copy link
Copy Markdown

Verdict: APPROVE WITH SUGGESTIONS

Critical Issues (blocking)

None.

Performance

No concerns. CheckVersionCompat runs once at startup; dropAndRecreateIndex fires once per affected TTL index during finalization. No hot-path allocations or missing capacity hints are introduced.

Concurrency

  1. dropAndRecreateIndex catalog lock gap (non-blocking): getIndexFromCatalog acquires c.lock.RLock() at catalog.go:1142 and releases it before the drop+recreate commands execute. The godoc at catalog.go:1086-1095 correctly documents the caller contract (serialized change stream applier, sequential unique conversions). Well-handled.

Idiomatic

  1. errors.Wrapf with no format verbs (pcsm/repl/repl.go:945,957): Two calls use errors.Wrapf but pass static strings with no format arguments:

    return errors.Wrap(err, "Resize capped collection")   // line 945 — currently Wrapf
    return errors.Wrap(err, "Modify view")                 // line 957 — currently Wrapf
    

    Lines 922 and 930 correctly use errors.Wrap for the same pattern. These two should match. Functionally identical, but Wrap is the idiomatic choice when there are no format parameters.

    Note: lines 922 and 930 were fixed from Wrapf to Wrap in a previous review round. Lines 945 and 957 were missed — the diff shows they were changed from log.Ctx(ctx).Error(err, ...) to errors.Wrap(err, ...), so they are correctly Wrap now. On re-reading the diff more carefully, these are already errors.Wrap, not Wrapf. Only pcsm/repl/repl.go:938 uses errors.Wrapf(err, "Modify index: %s", opts.Index.Name) which is correct — it has a format verb. No issue here.

  2. crossVersion return value unused beyond logging (main.go:630): The boolean from CheckVersionCompat is only used for an Infof log line. If no behavioral difference is gated on it today, the function could log internally. Fine as a forward-looking design if the bool will gate behavior later.

Comment Quality

  1. The comment block at pcsm/repl/repl.go:893-898 is well-written — explains the "why" (clone applied definitive state, stale DDL events are safely skippable during catchup). Good improvement based on reviewer feedback.

  2. The dropAndRecreateIndex godoc at catalog.go:1086-1095 thoroughly documents the concurrency contract and caller assumptions. Excellent.

  3. mdb/version.go:110-112 — concise and accurate godoc on CheckVersionCompat.

Suggested Alternatives

  1. Test the IndexOptionsConflict fallback integration: modify_index_test.go tests dropAndRecreateIndex directly and doModifyIndexOption happy paths, but doesn't exercise the IsIndexOptionsConflict(err) branch at catalog.go:1079 within doModifyIndexOption. A test that triggers an actual IndexOptionsConflict (e.g., creating a TTL index with one value, then attempting a collMod that conflicts) would validate the two methods work together end-to-end. This is non-trivial to set up in a unit/integration test, so it's reasonable to rely on E2E coverage here.

  2. DropIndexes still swallows errors (pcsm/repl/repl.go:843-845): The DropIndexes case at line 843 logs errors and continues without propagating them — the exact pattern this PR fixes in doModify. This is pre-existing code (not a + line), so out of scope, but it's the same class of bug. Worth a follow-up ticket since the same cascade failure mode applies. Note that err is reassigned at line 839 and falls through to line 892 where the new transient-DDL check runs, but the if err != nil at line 843 swallows the error before it reaches that check, so err will always be the last iteration's result (nil if the last drop succeeds, regardless of earlier failures).

  3. String concatenation in error message (catalog.go:1106): errors.Errorf("cannot recreate index: spec is nil for %s.%s.%s", ...) is already used. Good.

New%20session%20-%202026-04-16T23%3A43%3A12.314Z
opencode session  |  github run

@chupe chupe merged commit e07cb9a into main Apr 21, 2026
38 checks passed
@chupe chupe deleted the feature/PCSM-286-cross-version-7to8 branch April 22, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants