Merged
Conversation
StopService was writing SINGLE_NODE to the state file before stopping the service. This is semantically incorrect - stopping a service should not modify the recorded cluster state. Added regression test to verify state file is not modified. Fixed typos in test descriptions while editing the file. [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)
- Add bosh.Recreate(deployment, instance) helper - bosh.Instances now exposes ProcessState [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)
- Minor refactoring to cleanup test setup Consolidates BeforeAll and AfterAll nodes, leveraging DeferCleanup to setup cleanup operations next to allocation operations. [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)
Adds a more complex bootstrap scenario by forcing a cluster to lose quorum and subsequently triggering a node recreate. This emulates real production scenarios where a cluster fails and an unresponsive node is recreated. The expectation is that after the work in the current story, "bootstrap" can be trivially run to restore a working cluster. As of this commit, this is a failing test because mysql fails in pre-start and the bosh instance is left in a state that bootstrap cannot trivially manage the mysql instances. [TNZ-67462](https://vmw-jira.broadcom.net/browse/TNZ-67462)
Problem: When BOSH recreates a VM during cluster quorum loss, pre-start would fail before monit services were configured. This prevented the bootstrap errand from running (it requires galera-agent to be available), creating a deadlock where operators couldn't restore quorum. Solution: Move health validation to post-start. Now pre-start completes successfully, monit services are configured, and the bootstrap errand can run even if post-start validation fails due to lack of quorum. The post-start validation logic is identical to the previous pre-start logic: poll port 8114 until galera-init reports healthy, or fail if the BPM process dies.
Member
|
Approving:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When BOSH recreates a VM during cluster quorum loss, pre-start would fail before monit services were configured. This prevented the bootstrap errand from running (as bootstrap requires galera-agent, monit services available), creating a deadlock where operators couldn't restore quorum without manual recovery steps.
Solution
Move health validation from pre-start to post-start. This allows monit services to be configured even if MySQL can't join the cluster, enabling the bootstrap errand to run and restore quorum.
Testing