Skip to content

schemastore: retry bootstrap on GC-stale snapshot to avoid panic#4404

Open
lidezhu wants to merge 3 commits intomasterfrom
ldz/fix-schema-store0309
Open

schemastore: retry bootstrap on GC-stale snapshot to avoid panic#4404
lidezhu wants to merge 3 commits intomasterfrom
ldz/fix-schema-store0309

Conversation

@lidezhu
Copy link
Collaborator

@lidezhu lidezhu commented Mar 9, 2026

What problem does this PR solve?

Issue Number: close #4407

What is changed and how it works?

This pull request addresses a potential panic during the schema store's initialization process by implementing a robust retry mechanism. It ensures that if the initial attempt to load schema data from KV storage fails due to transient issues like stale snapshots or GC lifetime errors, the system will gracefully retry the operation with an updated GC safe point, preventing application crashes and improving system resilience.

Highlights

  • Panic Fix: Resolved a potential panic in initializeFromKVStorage by changing log.Fatal calls to return errors, allowing for graceful error handling instead of application crashes.
  • Retry Mechanism: Implemented a robust retry loop for initializeFromKVStorage within the initialize method. This mechanism specifically handles transient issues like stale snapshots or GC lifetime errors by retrying with an updated GC safe point.
  • Error Handling Improvement: Modified the initializeFromKVStorage function signature to return an error, enabling proper error propagation and more controlled handling of initialization failures.
  • New Helper Function: Introduced isRetryableInitializeFromKVStorageError, a new helper function to accurately determine if an initialization error is transient and warrants a retry.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Bug Fixes

    • Improved initialization resilience with automatic retries for recoverable startup errors and clearer retry logging.
    • Enhanced error handling and resource cleanup to avoid leaking resources when initialization fails.
  • Tests

    • Added unit test coverage verifying classification of retryable vs non-retryable initialization errors.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 9, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential panic during the schema store's initialization process by implementing a robust retry mechanism. It ensures that if the initial attempt to load schema data from KV storage fails due to transient issues like stale snapshots or GC lifetime errors, the system will gracefully retry the operation with an updated GC safe point, preventing application crashes and improving system resilience.

Highlights

  • Panic Fix: Resolved a potential panic in initializeFromKVStorage by changing log.Fatal calls to return errors, allowing for graceful error handling instead of application crashes.
  • Retry Mechanism: Implemented a robust retry loop for initializeFromKVStorage within the initialize method. This mechanism specifically handles transient issues like stale snapshots or GC lifetime errors by retrying with an updated GC safe point.
  • Error Handling Improvement: Modified the initializeFromKVStorage function signature to return an error, enabling proper error propagation and more controlled handling of initialization failures.
  • New Helper Function: Introduced isRetryableInitializeFromKVStorageError, a new helper function to accurately determine if an initialization error is transient and warrants a retry.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • logservice/schemastore/persist_storage.go
    • Modified initializeFromKVStorage to return an error instead of calling log.Fatal.
    • Wrapped the call to initializeFromKVStorage in initialize with a for loop to retry on specific errors.
    • Added a new function isRetryableInitializeFromKVStorageError to check for retryable errors.
    • Included logic to update the GC safe point and ensure changefeed start TS safety during retries.
    • Added db.Close() call on error in initializeFromKVStorage.
  • logservice/schemastore/persist_storage_test.go
    • Imported cerror package.
    • Added TestIsRetryableInitializeFromKVStorageError to test the new error classification logic.
Activity
  • No human activity (comments, reviews, etc.) has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

Reworks SchemaStore bootstrap to classify initialization errors, add retry logic when KV snapshots are lost due to GC, and propagate errors from initializeFromKVStorage while ensuring DB cleanup and refreshed gcSafePoint acquisition.

Changes

Cohort / File(s) Summary
SchemaStore persist storage
logservice/schemastore/persist_storage.go
Added getAndEnsureGcSafePoint(ctx, ...) and isRetryableInitializeFromKVStorageError(err). initializeFromKVStorage now returns error. Initialize path gained retry loop that refreshes GC safepoint on retryable errors and ensures DB is closed on failure; removed fatal exits.
Unit tests
logservice/schemastore/persist_storage_test.go
Added TestIsRetryableInitializeFromKVStorageError to verify classification of GC-snapshot and generic errors as retryable/non-retryable.
Module file
go.mod
Touched (small manifest update present in diff).

Sequence Diagram(s)

sequenceDiagram
  participant SchemaStore
  participant PD as PD (gc safe point)
  participant KV as TiKV / KV Snapshot
  participant LocalDB as Local DB

  SchemaStore->>PD: getAndEnsureGcSafePoint(ctx)
  PD-->>SchemaStore: gcSafePoint
  SchemaStore->>LocalDB: open DB (dbPath)
  SchemaStore->>KV: load snapshot at gcSafePoint
  alt snapshot load success
    KV-->>SchemaStore: snapshot
    SchemaStore->>LocalDB: initialize from snapshot
    LocalDB-->>SchemaStore: initialized
  else snapshot lost / GC error
    KV-->>SchemaStore: ErrSnapshotLostByGC / GC lifetime error
    SchemaStore->>SchemaStore: isRetryableInitializeFromKVStorageError(err)
    alt retryable
      SchemaStore->>PD: getAndEnsureGcSafePoint(ctx) (refresh)
      SchemaStore->>KV: retry load snapshot
    else non-retryable
      SchemaStore->>LocalDB: close DB
      SchemaStore-->>Caller: return error
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, approved

Suggested reviewers

  • 3AceShowHand
  • tenfyzhong

Poem

🐰 I hopped through logs, found safepoints anew,
When snapshots slipped, I tried once more or two,
DB doors closed gentle if errors stay,
I refreshed the GC and bounced on my way —
A little retry, and startup's okay. 🥕

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description provides helpful context in the "What is changed and how it works?" section with implementation highlights, but several required template sections lack specific information. Complete the unanswered questions (performance regression, documentation updates) and fill the release note section. Clarify which test types are included in the checklist.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: implementing a retry mechanism for schema store bootstrap when GC snapshots become stale, replacing the previous behavior that would panic.
Linked Issues check ✅ Passed The PR successfully implements the core objective from #4407: it converts fatal errors (log.Fatal) to recoverable errors and implements a retry mechanism with refreshed GC safe points, directly addressing the ErrSnapshotLostByGC handling requirement.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the objective: new helper function isRetryableInitializeFromKVStorageError, updated method signatures for error handling, and a test function validating retry logic. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ldz/fix-schema-store0309

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a panic in the schemastore by replacing log.Fatal with proper error handling and introducing a retry mechanism for initialization from KV storage. This is a good improvement for robustness. However, the new retry logic contains a critical bug that could lead to an infinite loop under certain failure conditions. I've provided a suggestion to fix this.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
logservice/schemastore/persist_storage_test.go (1)

2898-2908: Consider adding test cases for edge scenarios.

The current tests cover the main paths well. Consider adding coverage for:

  1. nil error input (the function explicitly handles this at line 317-319)
  2. Wrapped ErrSnapshotLostByGC to verify the RFCCode unwrapping behavior works correctly
♻️ Suggested additional test cases
 func TestIsRetryableInitializeFromKVStorageError(t *testing.T) {
+	require.False(t, isRetryableInitializeFromKVStorageError(nil))
 	require.True(t, isRetryableInitializeFromKVStorageError(
 		fmt.Errorf("snapshot is lost because GC life time is shorter than transaction duration"),
 	))
 	require.True(t, isRetryableInitializeFromKVStorageError(
 		cerror.ErrSnapshotLostByGC.GenWithStackByArgs(100, 200),
 	))
+	// Test wrapped ErrSnapshotLostByGC
+	require.True(t, isRetryableInitializeFromKVStorageError(
+		fmt.Errorf("wrapped: %w", cerror.ErrSnapshotLostByGC.GenWithStackByArgs(100, 200)),
+	))
 	require.False(t, isRetryableInitializeFromKVStorageError(
 		fmt.Errorf("non retryable error"),
 	))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/schemastore/persist_storage_test.go` around lines 2898 - 2908, Add
two edge-case assertions to TestIsRetryableInitializeFromKVStorageError: one
that passes a nil error and asserts false (to exercise the nil guard in
isRetryableInitializeFromKVStorageError), and one that passes an error which
wraps cerror.ErrSnapshotLostByGC (e.g., fmt.Errorf("context: %w",
cerror.ErrSnapshotLostByGC.GenWithStackByArgs(...))) and asserts true to verify
the function correctly unwraps by RFCCode; locate and update
TestIsRetryableInitializeFromKVStorageError and reference
isRetryableInitializeFromKVStorageError and cerror.ErrSnapshotLostByGC when
adding these cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@logservice/schemastore/persist_storage.go`:
- Around line 259-278: Log messages claim "will retry in 1s" but the code
continues immediately (the actual 1s sleep occurs elsewhere); update the
messages in the p.getGcSafePoint and gc.EnsureChangefeedStartTsSafety error
paths to either remove the "in 1s" phrase (e.g., "will retry") or perform a
time.Sleep(1 * time.Second) before the continue; additionally, when
getGcSafePoint fails in the getGcSafePoint error branch, clear or reset
gcSafePoint (e.g., set gcSafePoint = 0) before continue to avoid using a stale
gcSafePoint later; make these edits around p.getGcSafePoint, gcSafePoint, and
gc.EnsureChangefeedStartTsSafety calls.

---

Nitpick comments:
In `@logservice/schemastore/persist_storage_test.go`:
- Around line 2898-2908: Add two edge-case assertions to
TestIsRetryableInitializeFromKVStorageError: one that passes a nil error and
asserts false (to exercise the nil guard in
isRetryableInitializeFromKVStorageError), and one that passes an error which
wraps cerror.ErrSnapshotLostByGC (e.g., fmt.Errorf("context: %w",
cerror.ErrSnapshotLostByGC.GenWithStackByArgs(...))) and asserts true to verify
the function correctly unwraps by RFCCode; locate and update
TestIsRetryableInitializeFromKVStorageError and reference
isRetryableInitializeFromKVStorageError and cerror.ErrSnapshotLostByGC when
adding these cases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3db1da94-5c83-45fd-834b-6bad63453d63

📥 Commits

Reviewing files that changed from the base of the PR and between c04915e and 6525e8b.

📒 Files selected for processing (2)
  • logservice/schemastore/persist_storage.go
  • logservice/schemastore/persist_storage_test.go

@lidezhu lidezhu changed the title schemastore: fix panic schemastore: retry bootstrap on GC-stale snapshot to avoid panic Mar 9, 2026
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 9, 2026
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 10, 2026
@lidezhu
Copy link
Collaborator Author

lidezhu commented Mar 10, 2026

/test all

@ti-chi-bot ti-chi-bot bot added the lgtm label Mar 10, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, tenfyzhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,tenfyzhong]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 10, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 10, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-10 03:06:01.218102715 +0000 UTC m=+319392.730160386: ☑️ agreed by tenfyzhong.
  • 2026-03-10 07:04:13.961396152 +0000 UTC m=+333685.473453803: ☑️ agreed by 3AceShowHand.

@lidezhu
Copy link
Collaborator Author

lidezhu commented Mar 10, 2026

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SchemaStore bootstrap may exit on ErrSnapshotLostByGC

3 participants