fix: auto-purge corrupted buildkit cache on integrity check failure#101
Open
devin-ai-integration[bot] wants to merge 1 commit into
Open
fix: auto-purge corrupted buildkit cache on integrity check failure#101devin-ai-integration[bot] wants to merge 1 commit into
devin-ai-integration[bot] wants to merge 1 commit into
Conversation
When the BoltDB integrity check fails at startup, wipe /var/lib/buildkit so buildkitd starts fresh. Previously, the code only logged an error and proceeded with the corrupted disk, creating a corruption persistence loop where the same corrupted snapshot was cloned on every subsequent build. The build will have a cache miss that one time, but the fresh state will be committed at the end, breaking the loop automatically. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the BoltDB integrity check fails at startup, this PR automatically wipes
/var/lib/buildkitso buildkitd starts fresh, breaking the corruption persistence loop.The problem: Previously, when
checkBoltDbIntegrityfailed at startup (line 490), the code only loggedcore.error("BoltDB integrity check failed")and proceeded to use the corrupted disk. At commit time, a failed integrity check correctly skips the commit — but the previously committed corrupted snapshot remains the latest. The next job clones that same corrupted snapshot, creating an infinite loop that can only be broken by manual disk deletion.The fix: On integrity check failure at startup:
/var/lib/buildkit(usingfind -mindepth 1 -deleteto preserve the mount point)reportBuildPushActionFailureThe build will have a cache miss that one time, but the fresh databases will pass the post-action integrity check and be committed, restoring the cache for subsequent builds.
Customer context: tandemtechnology reported that releases were taking much longer since ~4PM ET on May 27. Root cause was a corrupted sticky disk. They had to manually delete the disk after we identified the issue.
Review & Testing Checklist for Human
find -mindepth 1 -deletecommand correctly removes all buildkit contents while preserving the mount point itself (the mount point/var/lib/buildkitmust remain since it's a sticky disk mount)/var/lib/buildkitdirectory (it should create fresh database files)reportBuildPushActionFailurewith eventboltdb integrity auto-purgewould be valuable for proactive monitoringcheckBoltDbIntegritytreats those as "skip" (returns true), so this should be safe.Recommended test plan: Deliberately corrupt a boltdb file in a test sticky disk, then run a build using
setup-docker-builderand verify:Notes
reportBuildPushActionFailurecall sends the event to the Blacksmith backend API at/stickydisks/report-failed, providing visibility into auto-purge events.Link to Devin session: https://app.devin.ai/sessions/4fe8e10b4d484f9787153e9ef5c88053
Need help on this PR? Tag
@codesmithwith what you need. Autofix is disabled.