PCSM-286: Cross-version replication support (7.0 to 8.0)#199
Conversation
8af5d47 to
e44588f
Compare
Fix DDL error propagation in doModify() that silently swallowed errors, causing cascading failures in cross-version replication. Add version guardrails at startup (block downgrade, validate FCV). Handle IndexOptionsConflict during finalization via drop-and-recreate. Split MONGO_VERSION into SRC_MONGO_VERSION/TGT_MONGO_VERSION across test infrastructure. Add 7.0->8.0 combo to CI matrix. 7.0->8.0 E2E: 197 passed, 0 failed (was 99 failures before fix).
e62cfa6 to
82d9f6c
Compare
Add IsInvalidOptions and IsIndexNotFound to the non-fatal error guard in applyDDLChange(). During concurrent clone and change stream replay, DDL events can arrive for collections whose state has changed (dropped, recreated as non-capped, etc). These errors are benign because the final state converges regardless. Also fix topo->mdb references left over from the package rename.
Remove unused sourceVersion/targetVersion fields from server struct. Remove redundant namespace exclusion log from repl phase (clone phase log in clone.go already covers this). Remove threaded IndexSpecification parameter from doModifyIndexOption, dropAndRecreateIndex now fetches the spec via getIndexFromCatalog on demand.
71b7c49 to
6611162
Compare
|
I have now reviewed all the changed code thoroughly. Here is my review: Verdict: APPROVE WITH SUGGESTIONSCritical Issues (blocking)None. PerformanceNo concerns. The new code paths ( Concurrency
Idiomatic
Comment Quality
Suggested Alternatives
Overall this is a well-structured fix. The core change (making |
Verdict: APPROVE WITH SUGGESTIONSCritical Issues (blocking)None. PerformanceNo concerns. All new code paths are infrequent: Concurrency
Idiomatic
Comment Quality
Suggested Alternatives
|
- Use errors.Wrap (no format args) in doModify for changeStreamPreAndPostImages and validation modifications - Add caller-contract godoc to dropAndRecreateIndex documenting the RLock release before drop+recreate - Log recreated index at Info level on IndexOptionsConflict fallback for operator visibility
88413d2 to
b207748
Compare
Verdict: APPROVE WITH SUGGESTIONSCritical Issues (blocking)None. PerformanceNo concerns. Concurrency
Idiomatic
Comment Quality
Suggested Alternatives
|



Summary
Fixed cross-version replication (7.0 to 8.0) by correcting DDL error propagation in
doModify()that was silently swallowing errors and causing cascading failures. Added startup version guardrails to block unsupported downgrade scenarios and handleIndexOptionsConflictduring TTL index finalization via drop-and-recreate fallback. Enhanced test infrastructure to support cross-version testing with separate source and target MongoDB versions.Test Results: 7.0→8.0 E2E tests now pass 197/197 (previously 99/197 failures)
PCSM-286
Problem
Cross-version replication (7.0 source to 8.0 target) fails 99 out of 197 E2E tests. The root cause is
doModify()inpcsm/repl/repl.goswallowing DDL errors. When acollModoperation fails, PCSM logs the error but continues in a broken state. Subsequent DML events fail silently because the target's collection state is inconsistent, creating a cascade whereeventsAppliedfalls far behindeventsRead(104 vs 2004) and corrupts the replication session.A secondary issue: during finalization, restoring TTL indexes from
expireAfterSeconds: 2147483647back to real values hitsIndexOptionsConflict(error 85) because MongoDB 8.0 has strictercollModvalidation than 7.0.There were no startup guardrails preventing downgrade replication (8.0 source to 7.0 target).
Solution
DDL error propagation (
pcsm/repl/repl.go).doModify()now returnserrorinstead of void. All six internal error paths returnerrors.Wrapf()instead of logging and swallowing. The error flows up torun(), which callssetFailed(). PCSM entersfailedstate on the first DDL error instead of silently accumulating broken state. This single change resolves the 99 test failures, because from a clean state thecollModoperations don't actually fail in 7.0 to 8.0. The failures were cascade artifacts from accumulated corruption.Transient DDL error handling (
pcsm/repl/repl.go,mdb/errors.go).applyDDLChange()treatsNamespaceNotFound,IndexNotFound, andInvalidOptionsas non-fatal during change stream replay (warn + skip). These errors are benign during clone catch-up: the change stream captures DDL events for collections in transient states, and by the time PCSM replays them the target has moved past the point where the operations are meaningful. The final state converges regardless. This follows the established pattern incatalog.go(Rename, DropIndex).IndexOptionsConflict recovery (
pcsm/catalog/catalog.go).doModifyIndexOption()catchesIndexOptionsConflictfromcollModand falls back to dropping the index and recreating it from theIndexSpecificationstored in the catalog. This fires only during finalization, once per affected TTL index.Version guardrails (
main.go).createServer()blocks startup if source major version exceeds target (downgrade) and warns on cross-version upgrade. UsesbuildInfoversion comparison, which works on both mongod and mongos across all supported versions.Test infrastructure (
hack/,.github/workflows/).MONGO_VERSIONsplit intoSRC_MONGO_VERSIONandTGT_MONGO_VERSIONwith cascading defaults. RS CI matrix gets a 7.0 to 8.0 cross-version entry.