<fix>[vm]: 修复迁移失败锁回滚#3977
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml) Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
Walkthrough在迁移失败处理链路中增加对目的宿主的 VM 状态回查;根据回查结果在目的端完成迁移(更新 DB 并触发扩展点)或在源端回滚;并新增集成测试覆盖目标主机返回 Running 与非 Running 两条路径。 变更说明VM 迁移失败处理流程重构
Sequence Diagram(s)sequenceDiagram
participant Client
participant VmInstanceBase
participant DestHost
participant Database
participant ExtEmitter
Client->>VmInstanceBase: 发起 MigrateVmAction(迁移)
VmInstanceBase->>DestHost: 发送 migrate 请求(失败回调)
VmInstanceBase->>DestHost: CheckVmStateOnHypervisorMsg(getVmStateOnHost)
DestHost-->>VmInstanceBase: 回复 VM 状态("Running" / "Stopped" / "Paused")
alt 状态为 Running
VmInstanceBase->>DestHost: checkState
VmInstanceBase->>Database: 更新 zone/cluster/lastHostUuid/hostUuid 并刷新 VM
VmInstanceBase->>ExtEmitter: postMigrateVm
ExtEmitter-->>VmInstanceBase: 返回
VmInstanceBase->>ExtEmitter: afterMigrateVm
ExtEmitter-->>VmInstanceBase: 返回
VmInstanceBase-->>Client: completion.success()
else 状态非 Running 或 查询失败
VmInstanceBase->>ExtEmitter: failedToMigrateVm
ExtEmitter-->>VmInstanceBase: 返回
VmInstanceBase->>DestHost: (若特定错误码)对原宿主再次 checkState
VmInstanceBase-->>Client: completion.fail(err)
end
代码审查工作量🎯 4 (复杂) | ⏱️ ~45 分钟 诗歌
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ast-grep (0.42.2)compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.javaComment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java`:
- Around line 7248-7257: The recovery-success branch skips the
VmMigratePostCallExtensionFlow.postMigrateVm() extension and only calls
completeMigrateVmOnDestination()/extEmitter.afterMigrateVm(), causing divergence
from the normal success path; update the recovery-success path to invoke the
same full post-migrate sequence as the normal success flow (i.e. run
VmMigratePostCallExtensionFlow.postMigrateVm() then
extEmitter.afterMigrateVm()), or refactor the success cleanup into a shared
helper and call that from both places (ensure postMigrateVm is executed before
afterMigrateVm); modify the code paths around completeMigrateVmOnDestination,
postMigrateVm, and extEmitter.afterMigrateVm to reuse the unified cleanup flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: ccf6c20e-60bc-4277-945b-ad11b7e778a1
📒 Files selected for processing (2)
compute/src/main/java/org/zstack/compute/vm/VmGlobalConfig.javacompute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
|
Comment on Comment from shan.wu:
Addressed: recovery-success cleanup now calls |
8d3ba77 to
d3f3fe1
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java`:
- Around line 7209-7238: handleFailedMigrateVm only checks the destination host
once (via getVmStateOnHost) and immediately calls completeMigrateVmOnDestination
if the VM is running on destination, which can prematurely treat migration as
complete while the source is still alive; update handleFailedMigrateVm to (a)
when destination reports the VM running, also verify the source host state (use
getVmStateOnHost for lastHostUuid and isVmRunningOnHost) and only call
completeMigrateVmOnDestination if the source is confirmed not running there, or
(b) implement a retry loop with interval and timeout that rechecks destination
and source states before deciding, and fall back to rollbackFailedMigrateVm if
checks/timeouts fail; touch the methods handleFailedMigrateVm, the
ReturnValueCompletion callbacks, and reuse
rollbackFailedMigrateVm/completeMigrateVmOnDestination to enforce the correct
gating.
- Around line 205-206: The helper isVmRunningOnHost(String state) currently
treats only VmInstanceState.Running as a successful post-migration state; update
it to consider other valid landed states (at minimum VmInstanceState.Paused in
addition to Running) so that a VM which was Paused before migration and remains
Paused on the target is treated as a successful recovery rather than triggering
rollback; modify the method (and the analogous checks referenced around lines
7218-7223) to return true for VmInstanceState.Running.toString().equals(state)
|| VmInstanceState.Paused.toString().equals(state) (or an equivalent check
against an allowed-success set) and keep the method name isVmRunningOnHost
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)
Review profile: CHILL
Plan: Pro
Run ID: 3d10bac5-21f9-4fc4-92d1-88cb057aaf17
📒 Files selected for processing (1)
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
| private boolean isVmRunningOnHost(String state) { | ||
| return VmInstanceState.Running.toString().equals(state); |
There was a problem hiding this comment.
恢复成功判定不要只接受 Running。
正常成功路径已经保留了迁移前为 Paused 的语义,但这里把“已迁移完成”硬编码成了目的端必须是 Running。这样一来,暂停态 VM 在目的端已经成功落地且宿主返回 Paused 时,恢复分支仍会误走回滚,和正常迁移成功路径的行为不一致。
💡 建议修改
- private boolean isVmRunningOnHost(String state) {
- return VmInstanceState.Running.toString().equals(state);
+ private boolean isVmCompletedOnHost(String state, VmInstanceState originState) {
+ if (VmInstanceState.Running.toString().equals(state)) {
+ return true;
+ }
+
+ return originState == VmInstanceState.Paused
+ && VmInstanceState.Paused.toString().equals(state);
}- if (!isVmRunningOnHost(state)) {
+ if (!isVmCompletedOnHost(state, originState)) {
rollbackFailedMigrateVm(originState, destHostUuid, errCode, completion);
return;
}Also applies to: 7218-7223
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
205 - 206, The helper isVmRunningOnHost(String state) currently treats only
VmInstanceState.Running as a successful post-migration state; update it to
consider other valid landed states (at minimum VmInstanceState.Paused in
addition to Running) so that a VM which was Paused before migration and remains
Paused on the target is treated as a successful recovery rather than triggering
rollback; modify the method (and the analogous checks referenced around lines
7218-7223) to return true for VmInstanceState.Running.toString().equals(state)
|| VmInstanceState.Paused.toString().equals(state) (or an equivalent check
against an allowed-success set) and keep the method name isVmRunningOnHost
unchanged.
| private void handleFailedMigrateVm(final VmInstanceSpec spec, final VmInstanceState originState, | ||
| final String lastHostUuid, final ErrorCode errCode, | ||
| final Completion completion) { | ||
| String destHostUuid = spec.getDestHost().getUuid().equals(lastHostUuid) ? null : spec.getDestHost().getUuid(); | ||
| if (destHostUuid == null) { | ||
| rollbackFailedMigrateVm(originState, null, errCode, completion); | ||
| return; | ||
| } | ||
|
|
||
| getVmStateOnHost(destHostUuid, new ReturnValueCompletion<String>(completion) { | ||
| @Override | ||
| public void success(String state) { | ||
| if (!isVmRunningOnHost(state)) { | ||
| rollbackFailedMigrateVm(originState, destHostUuid, errCode, completion); | ||
| return; | ||
| } | ||
|
|
||
| logger.warn(String.format("migrating vm[uuid:%s] failed with error[%s], but the vm is running on destination host[uuid:%s]; complete migration cleanup on destination host", | ||
| self.getUuid(), errCode.getDetails(), destHostUuid)); | ||
| completeMigrateVmOnDestination(spec, lastHostUuid, completion); | ||
| } | ||
|
|
||
| @Override | ||
| public void fail(ErrorCode errorCode) { | ||
| logger.warn(String.format("unable to check vm[uuid:%s] state on destination host[uuid:%s] after migration failure, %s", | ||
| self.getUuid(), destHostUuid, errorCode)); | ||
| rollbackFailedMigrateVm(originState, destHostUuid, errCode, completion); | ||
| } | ||
| }); | ||
| } |
There was a problem hiding this comment.
这里只查目的宿主一次,仍然会在双活收敛窗口里提前判定成功。
当前分支在 API 失败后只查询一次目的宿主;只要目的端报活,就直接执行 completeMigrateVmOnDestination()。但本 PR 的根因正是 libvirt 收敛窗口里源/目的端都可能同时存活。这里没有继续核对源宿主,也没有按目标设计做 interval/timeout 重试,所以仍可能在源端尚未退出时提前切换到成功清理路径。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
7209 - 7238, handleFailedMigrateVm only checks the destination host once (via
getVmStateOnHost) and immediately calls completeMigrateVmOnDestination if the VM
is running on destination, which can prematurely treat migration as complete
while the source is still alive; update handleFailedMigrateVm to (a) when
destination reports the VM running, also verify the source host state (use
getVmStateOnHost for lastHostUuid and isVmRunningOnHost) and only call
completeMigrateVmOnDestination if the source is confirmed not running there, or
(b) implement a retry loop with interval and timeout that rechecks destination
and source states before deciding, and fall back to rollbackFailedMigrateVm if
checks/timeouts fail; touch the methods handleFailedMigrateVm, the
ReturnValueCompletion callbacks, and reuse
rollbackFailedMigrateVm/completeMigrateVmOnDestination to enforce the correct
gating.
d3f3fe1 to
aa8f1e4
Compare
Check the VM state on the destination host after the migration API fails. If the destination host reports Running, treat the migration as completed. Run the success completion path in that case. Otherwise, keep the original rollback behavior. Resolves: ZSTAC-83894 Change-Id: I8b4774a405fc3b1c05d21b6742facd26bc8d03e6
aa8f1e4 to
8a602fb
Compare
Root Cause
Migration API failure was always handled as a normal migration failure rollback. In practice, libvirt migration may have already completed or may still be converging while the API call reports failure. In that window, both source and destination hosts can report the VM as alive, and rolling back immediately can leave LV lock ownership inconsistent with the actual VM runtime side.
Solution
Test
git diff --checkpassed.mvn -pl compute -am -DskipTests compilewas attempted but stopped in the existingnetworkmodule class-resolution issue before reaching compute.sync from gitlab !9869