Skip to content

<fix>[vm]: 修复迁移失败锁回滚#3977

Open
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/shan.wu/fix/5.5.22/ZSTAC-83894@@2
Open

<fix>[vm]: 修复迁移失败锁回滚#3977
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/shan.wu/fix/5.5.22/ZSTAC-83894@@2

Conversation

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

Root Cause

Migration API failure was always handled as a normal migration failure rollback. In practice, libvirt migration may have already completed or may still be converging while the API call reports failure. In that window, both source and destination hosts can report the VM as alive, and rolling back immediately can leave LV lock ownership inconsistent with the actual VM runtime side.

Solution

  • After migration API failure, re-check VM state on destination and source hosts.
  • If destination is alive and source is not alive, treat the migration as actually completed and run the migration-success cleanup path.
  • If destination is not alive, keep the original migration-failure rollback behavior.
  • If both source and destination are alive, re-check every 10 seconds for up to 300 seconds.
  • If both remain alive after the recheck window, roll back and include an explicit error detail that the API failed while the VM may still be migrating.
  • Add global configs for the recheck interval and timeout.

Test

sync from gitlab !9869

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: e5c09e5f-cace-4e0b-b68d-05f0604fa473

📥 Commits

Reviewing files that changed from the base of the PR and between aa8f1e4 and 8a602fb.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
  • test/src/test/groovy/org/zstack/test/integration/kvm/vm/migrate/MigrateVmFailureCheckTargetHostCase.groovy
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/src/test/groovy/org/zstack/test/integration/kvm/vm/migrate/MigrateVmFailureCheckTargetHostCase.groovy
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java

Walkthrough

在迁移失败处理链路中增加对目的宿主的 VM 状态回查;根据回查结果在目的端完成迁移(更新 DB 并触发扩展点)或在源端回滚;并新增集成测试覆盖目标主机返回 Running 与非 Running 两条路径。

变更说明

VM 迁移失败处理流程重构

层级 / 文件 总结
宿主侧 VM 状态查询基础设施
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
新增 getVmStateOnHost 向目标宿主发送 CheckVmStateOnHypervisorMsg 并提取回包中的 VM 状态字符串;新增 isVmRunningOnHost 判定状态是否为 Running
迁移失败处理入口与协调
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
doMigrateVm 错误处理回调改为调用 handleFailedMigrateVm;该方法在目的宿主与迁移前最后宿主不同时发起状态查询,根据结果分流至完成或回滚流程。
迁移完成于目的宿主
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
新增 completeMigrateVmOnDestination 执行 checkState 确认状态后,更新数据库的 zone、cluster、lastHostUuid 和 hostUuid,并按序触发 extEmitter.postMigrateVmextEmitter.afterMigrateVm 完成目的端清理。
迁移回滚于源宿主
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java
新增 rollbackFailedMigrateVm 先触发 extEmitter.failedToMigrateVm;按错误码决定是否对原宿主执行 checkState;最终按 originState.getDrivenEvent() 恢复状态并 fail。
新增集成测试
test/src/test/groovy/org/zstack/test/integration/kvm/vm/migrate/MigrateVmFailureCheckTargetHostCase.groovy
新增测试类构建完整拓扑,模拟迁移失败并通过消息钩子设置目标主机上 VM 状态为 Running 或非 Running,分别断言迁移成功或回滚行为。

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant VmInstanceBase
  participant DestHost
  participant Database
  participant ExtEmitter

  Client->>VmInstanceBase: 发起 MigrateVmAction(迁移)
  VmInstanceBase->>DestHost: 发送 migrate 请求(失败回调)
  VmInstanceBase->>DestHost: CheckVmStateOnHypervisorMsg(getVmStateOnHost)
  DestHost-->>VmInstanceBase: 回复 VM 状态("Running" / "Stopped" / "Paused")
  alt 状态为 Running
    VmInstanceBase->>DestHost: checkState
    VmInstanceBase->>Database: 更新 zone/cluster/lastHostUuid/hostUuid 并刷新 VM
    VmInstanceBase->>ExtEmitter: postMigrateVm
    ExtEmitter-->>VmInstanceBase: 返回
    VmInstanceBase->>ExtEmitter: afterMigrateVm
    ExtEmitter-->>VmInstanceBase: 返回
    VmInstanceBase-->>Client: completion.success()
  else 状态非 Running 或 查询失败
    VmInstanceBase->>ExtEmitter: failedToMigrateVm
    ExtEmitter-->>VmInstanceBase: 返回
    VmInstanceBase->>DestHost: (若特定错误码)对原宿主再次 checkState
    VmInstanceBase-->>Client: completion.fail(err)
  end
Loading

代码审查工作量

🎯 4 (复杂) | ⏱️ ~45 分钟

诗歌

迁移失败时,小兔来回查 🐰
目标宿主答“Running”,旅程便可达
若它沉睡或暂停,回源安置家
扩展点跳动,状态更新无差
小兔跳跃庆祝,日志与测试齐发 ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题格式符合要求 ([scope]: ),长度为20字符,完全相关于主要变更内容(修复迁移失败的锁回滚机制)。
Description check ✅ Passed 描述与变更集充分相关,详细说明了根本原因、解决方案和测试方法,完全覆盖迁移失败处理链路的改进。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/shan.wu/fix/5.5.22/ZSTAC-83894@@2

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.42.2)
compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java`:
- Around line 7248-7257: The recovery-success branch skips the
VmMigratePostCallExtensionFlow.postMigrateVm() extension and only calls
completeMigrateVmOnDestination()/extEmitter.afterMigrateVm(), causing divergence
from the normal success path; update the recovery-success path to invoke the
same full post-migrate sequence as the normal success flow (i.e. run
VmMigratePostCallExtensionFlow.postMigrateVm() then
extEmitter.afterMigrateVm()), or refactor the success cleanup into a shared
helper and call that from both places (ensure postMigrateVm is executed before
afterMigrateVm); modify the code paths around completeMigrateVmOnDestination,
postMigrateVm, and extEmitter.afterMigrateVm to reuse the unified cleanup flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: ccf6c20e-60bc-4277-945b-ad11b7e778a1

📥 Commits

Reviewing files that changed from the base of the PR and between 4cdaed7 and e50ff07.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/vm/VmGlobalConfig.java
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java

Comment thread compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java Outdated
@zstack-robot-2
Copy link
Copy Markdown
Collaborator Author

Comment on compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java:

Comment from shan.wu:

恢复成功分支漏掉 postMigrateVm() 扩展点

Addressed: recovery-success cleanup now calls postMigrateVm() before afterMigrateVm(), matching the normal migration success order.

@MatheMatrix MatheMatrix force-pushed the sync/shan.wu/fix/5.5.22/ZSTAC-83894@@2 branch 2 times, most recently from 8d3ba77 to d3f3fe1 Compare May 17, 2026 07:28
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java`:
- Around line 7209-7238: handleFailedMigrateVm only checks the destination host
once (via getVmStateOnHost) and immediately calls completeMigrateVmOnDestination
if the VM is running on destination, which can prematurely treat migration as
complete while the source is still alive; update handleFailedMigrateVm to (a)
when destination reports the VM running, also verify the source host state (use
getVmStateOnHost for lastHostUuid and isVmRunningOnHost) and only call
completeMigrateVmOnDestination if the source is confirmed not running there, or
(b) implement a retry loop with interval and timeout that rechecks destination
and source states before deciding, and fall back to rollbackFailedMigrateVm if
checks/timeouts fail; touch the methods handleFailedMigrateVm, the
ReturnValueCompletion callbacks, and reuse
rollbackFailedMigrateVm/completeMigrateVmOnDestination to enforce the correct
gating.
- Around line 205-206: The helper isVmRunningOnHost(String state) currently
treats only VmInstanceState.Running as a successful post-migration state; update
it to consider other valid landed states (at minimum VmInstanceState.Paused in
addition to Running) so that a VM which was Paused before migration and remains
Paused on the target is treated as a successful recovery rather than triggering
rollback; modify the method (and the analogous checks referenced around lines
7218-7223) to return true for VmInstanceState.Running.toString().equals(state)
|| VmInstanceState.Paused.toString().equals(state) (or an equivalent check
against an allowed-success set) and keep the method name isVmRunningOnHost
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 3d10bac5-21f9-4fc4-92d1-88cb057aaf17

📥 Commits

Reviewing files that changed from the base of the PR and between 8d3ba77 and d3f3fe1.

📒 Files selected for processing (1)
  • compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java

Comment on lines +205 to +206
private boolean isVmRunningOnHost(String state) {
return VmInstanceState.Running.toString().equals(state);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

恢复成功判定不要只接受 Running

正常成功路径已经保留了迁移前为 Paused 的语义,但这里把“已迁移完成”硬编码成了目的端必须是 Running。这样一来,暂停态 VM 在目的端已经成功落地且宿主返回 Paused 时,恢复分支仍会误走回滚,和正常迁移成功路径的行为不一致。

💡 建议修改
-    private boolean isVmRunningOnHost(String state) {
-        return VmInstanceState.Running.toString().equals(state);
+    private boolean isVmCompletedOnHost(String state, VmInstanceState originState) {
+        if (VmInstanceState.Running.toString().equals(state)) {
+            return true;
+        }
+
+        return originState == VmInstanceState.Paused
+                && VmInstanceState.Paused.toString().equals(state);
     }
-                if (!isVmRunningOnHost(state)) {
+                if (!isVmCompletedOnHost(state, originState)) {
                     rollbackFailedMigrateVm(originState, destHostUuid, errCode, completion);
                     return;
                 }

Also applies to: 7218-7223

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
205 - 206, The helper isVmRunningOnHost(String state) currently treats only
VmInstanceState.Running as a successful post-migration state; update it to
consider other valid landed states (at minimum VmInstanceState.Paused in
addition to Running) so that a VM which was Paused before migration and remains
Paused on the target is treated as a successful recovery rather than triggering
rollback; modify the method (and the analogous checks referenced around lines
7218-7223) to return true for VmInstanceState.Running.toString().equals(state)
|| VmInstanceState.Paused.toString().equals(state) (or an equivalent check
against an allowed-success set) and keep the method name isVmRunningOnHost
unchanged.

Comment on lines +7209 to +7238
private void handleFailedMigrateVm(final VmInstanceSpec spec, final VmInstanceState originState,
final String lastHostUuid, final ErrorCode errCode,
final Completion completion) {
String destHostUuid = spec.getDestHost().getUuid().equals(lastHostUuid) ? null : spec.getDestHost().getUuid();
if (destHostUuid == null) {
rollbackFailedMigrateVm(originState, null, errCode, completion);
return;
}

getVmStateOnHost(destHostUuid, new ReturnValueCompletion<String>(completion) {
@Override
public void success(String state) {
if (!isVmRunningOnHost(state)) {
rollbackFailedMigrateVm(originState, destHostUuid, errCode, completion);
return;
}

logger.warn(String.format("migrating vm[uuid:%s] failed with error[%s], but the vm is running on destination host[uuid:%s]; complete migration cleanup on destination host",
self.getUuid(), errCode.getDetails(), destHostUuid));
completeMigrateVmOnDestination(spec, lastHostUuid, completion);
}

@Override
public void fail(ErrorCode errorCode) {
logger.warn(String.format("unable to check vm[uuid:%s] state on destination host[uuid:%s] after migration failure, %s",
self.getUuid(), destHostUuid, errorCode));
rollbackFailedMigrateVm(originState, destHostUuid, errCode, completion);
}
});
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

这里只查目的宿主一次,仍然会在双活收敛窗口里提前判定成功。

当前分支在 API 失败后只查询一次目的宿主;只要目的端报活,就直接执行 completeMigrateVmOnDestination()。但本 PR 的根因正是 libvirt 收敛窗口里源/目的端都可能同时存活。这里没有继续核对源宿主,也没有按目标设计做 interval/timeout 重试,所以仍可能在源端尚未退出时提前切换到成功清理路径。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/vm/VmInstanceBase.java` around lines
7209 - 7238, handleFailedMigrateVm only checks the destination host once (via
getVmStateOnHost) and immediately calls completeMigrateVmOnDestination if the VM
is running on destination, which can prematurely treat migration as complete
while the source is still alive; update handleFailedMigrateVm to (a) when
destination reports the VM running, also verify the source host state (use
getVmStateOnHost for lastHostUuid and isVmRunningOnHost) and only call
completeMigrateVmOnDestination if the source is confirmed not running there, or
(b) implement a retry loop with interval and timeout that rechecks destination
and source states before deciding, and fall back to rollbackFailedMigrateVm if
checks/timeouts fail; touch the methods handleFailedMigrateVm, the
ReturnValueCompletion callbacks, and reuse
rollbackFailedMigrateVm/completeMigrateVmOnDestination to enforce the correct
gating.

@MatheMatrix MatheMatrix force-pushed the sync/shan.wu/fix/5.5.22/ZSTAC-83894@@2 branch from d3f3fe1 to aa8f1e4 Compare May 17, 2026 08:48
Check the VM state on the destination host after the migration API fails.

If the destination host reports Running, treat the migration as completed.
Run the success completion path in that case. Otherwise, keep the original
rollback behavior.

Resolves: ZSTAC-83894

Change-Id: I8b4774a405fc3b1c05d21b6742facd26bc8d03e6
@MatheMatrix MatheMatrix force-pushed the sync/shan.wu/fix/5.5.22/ZSTAC-83894@@2 branch from aa8f1e4 to 8a602fb Compare May 17, 2026 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant