schemastore: retry bootstrap on GC-stale snapshot to avoid panic#4404
schemastore: retry bootstrap on GC-stale snapshot to avoid panic#4404
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a potential panic during the schema store's initialization process by implementing a robust retry mechanism. It ensures that if the initial attempt to load schema data from KV storage fails due to transient issues like stale snapshots or GC lifetime errors, the system will gracefully retry the operation with an updated GC safe point, preventing application crashes and improving system resilience. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughReworks SchemaStore bootstrap to classify initialization errors, add retry logic when KV snapshots are lost due to GC, and propagate errors from Changes
Sequence Diagram(s)sequenceDiagram
participant SchemaStore
participant PD as PD (gc safe point)
participant KV as TiKV / KV Snapshot
participant LocalDB as Local DB
SchemaStore->>PD: getAndEnsureGcSafePoint(ctx)
PD-->>SchemaStore: gcSafePoint
SchemaStore->>LocalDB: open DB (dbPath)
SchemaStore->>KV: load snapshot at gcSafePoint
alt snapshot load success
KV-->>SchemaStore: snapshot
SchemaStore->>LocalDB: initialize from snapshot
LocalDB-->>SchemaStore: initialized
else snapshot lost / GC error
KV-->>SchemaStore: ErrSnapshotLostByGC / GC lifetime error
SchemaStore->>SchemaStore: isRetryableInitializeFromKVStorageError(err)
alt retryable
SchemaStore->>PD: getAndEnsureGcSafePoint(ctx) (refresh)
SchemaStore->>KV: retry load snapshot
else non-retryable
SchemaStore->>LocalDB: close DB
SchemaStore-->>Caller: return error
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request aims to fix a panic in the schemastore by replacing log.Fatal with proper error handling and introducing a retry mechanism for initialization from KV storage. This is a good improvement for robustness. However, the new retry logic contains a critical bug that could lead to an infinite loop under certain failure conditions. I've provided a suggestion to fix this.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
logservice/schemastore/persist_storage_test.go (1)
2898-2908: Consider adding test cases for edge scenarios.The current tests cover the main paths well. Consider adding coverage for:
nilerror input (the function explicitly handles this at line 317-319)- Wrapped
ErrSnapshotLostByGCto verify theRFCCodeunwrapping behavior works correctly♻️ Suggested additional test cases
func TestIsRetryableInitializeFromKVStorageError(t *testing.T) { + require.False(t, isRetryableInitializeFromKVStorageError(nil)) require.True(t, isRetryableInitializeFromKVStorageError( fmt.Errorf("snapshot is lost because GC life time is shorter than transaction duration"), )) require.True(t, isRetryableInitializeFromKVStorageError( cerror.ErrSnapshotLostByGC.GenWithStackByArgs(100, 200), )) + // Test wrapped ErrSnapshotLostByGC + require.True(t, isRetryableInitializeFromKVStorageError( + fmt.Errorf("wrapped: %w", cerror.ErrSnapshotLostByGC.GenWithStackByArgs(100, 200)), + )) require.False(t, isRetryableInitializeFromKVStorageError( fmt.Errorf("non retryable error"), )) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@logservice/schemastore/persist_storage_test.go` around lines 2898 - 2908, Add two edge-case assertions to TestIsRetryableInitializeFromKVStorageError: one that passes a nil error and asserts false (to exercise the nil guard in isRetryableInitializeFromKVStorageError), and one that passes an error which wraps cerror.ErrSnapshotLostByGC (e.g., fmt.Errorf("context: %w", cerror.ErrSnapshotLostByGC.GenWithStackByArgs(...))) and asserts true to verify the function correctly unwraps by RFCCode; locate and update TestIsRetryableInitializeFromKVStorageError and reference isRetryableInitializeFromKVStorageError and cerror.ErrSnapshotLostByGC when adding these cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@logservice/schemastore/persist_storage.go`:
- Around line 259-278: Log messages claim "will retry in 1s" but the code
continues immediately (the actual 1s sleep occurs elsewhere); update the
messages in the p.getGcSafePoint and gc.EnsureChangefeedStartTsSafety error
paths to either remove the "in 1s" phrase (e.g., "will retry") or perform a
time.Sleep(1 * time.Second) before the continue; additionally, when
getGcSafePoint fails in the getGcSafePoint error branch, clear or reset
gcSafePoint (e.g., set gcSafePoint = 0) before continue to avoid using a stale
gcSafePoint later; make these edits around p.getGcSafePoint, gcSafePoint, and
gc.EnsureChangefeedStartTsSafety calls.
---
Nitpick comments:
In `@logservice/schemastore/persist_storage_test.go`:
- Around line 2898-2908: Add two edge-case assertions to
TestIsRetryableInitializeFromKVStorageError: one that passes a nil error and
asserts false (to exercise the nil guard in
isRetryableInitializeFromKVStorageError), and one that passes an error which
wraps cerror.ErrSnapshotLostByGC (e.g., fmt.Errorf("context: %w",
cerror.ErrSnapshotLostByGC.GenWithStackByArgs(...))) and asserts true to verify
the function correctly unwraps by RFCCode; locate and update
TestIsRetryableInitializeFromKVStorageError and reference
isRetryableInitializeFromKVStorageError and cerror.ErrSnapshotLostByGC when
adding these cases.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3db1da94-5c83-45fd-834b-6bad63453d63
📒 Files selected for processing (2)
logservice/schemastore/persist_storage.gologservice/schemastore/persist_storage_test.go
|
/test all |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 3AceShowHand, tenfyzhong The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/hold |
What problem does this PR solve?
Issue Number: close #4407
What is changed and how it works?
This pull request addresses a potential panic during the schema store's initialization process by implementing a robust retry mechanism. It ensures that if the initial attempt to load schema data from KV storage fails due to transient issues like stale snapshots or GC lifetime errors, the system will gracefully retry the operation with an updated GC safe point, preventing application crashes and improving system resilience.
Highlights
initializeFromKVStorageby changinglog.Fatalcalls to return errors, allowing for graceful error handling instead of application crashes.initializeFromKVStoragewithin theinitializemethod. This mechanism specifically handles transient issues like stale snapshots or GC lifetime errors by retrying with an updated GC safe point.initializeFromKVStoragefunction signature to return anerror, enabling proper error propagation and more controlled handling of initialization failures.isRetryableInitializeFromKVStorageError, a new helper function to accurately determine if an initialization error is transient and warrants a retry.Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note
Summary by CodeRabbit
Bug Fixes
Tests