Skip to content

<fix>[kvm]: extend reconnect echo timeout after libvirtd restart#3990

Open
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/haoyu.ding/fix-84691
Open

<fix>[kvm]: extend reconnect echo timeout after libvirtd restart#3990
zstack-robot-2 wants to merge 1 commit into
5.5.22from
sync/haoyu.ding/fix-84691

Conversation

@zstack-robot-2
Copy link
Copy Markdown
Collaborator

Resolves: ZSTAC-84691

Change-Id: I6668736975707575756d7665647867686e6b6b76

sync from gitlab !9883

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

Walkthrough

在 KVMHost 连接/部署流程中加入判断部署是否触发 libvirtd 重启的标记,并在 echo-host 步骤使用基于主机 VM 数量计算的动态 echo 超时(或默认超时),同时新增工具常量与单元测试验证计算逻辑。

变更

Libvirtd 重启 Echo 超时自适应

Layer / File(s) Summary
超时计算工具与配置常量
plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java
新增 TimeUnit 导入和三个公开常量(VM 阈值、每 VM 秒数、最大额外秒数),新增 shouldRestartLibvirtdDuringDeploy()calculateLibvirtRestartEchoTimeoutMillis()
连接流程中的超时计算集成
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
新增私有方法 getKvmAgentEchoTimeoutAfterDeploy(),在 connectHook 中增加 restartLibvirtdDuringDeploy 字段并在部署参数准备后赋值,在 echo-host 步骤计算并使用 echoTimeout 替换原有固定超时(包括重启 kvmagent 的分支)。
单元测试:工具方法行为验证
test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java
新增多组测试覆盖 shouldRestartLibvirtdDuringDeploycalculateLibvirtRestartEchoTimeoutMillis 的阈值、追加和封顶行为,临时修改 CoreGlobalProperty.REST_FACADE_ECHO_TIMEOUT 以断言不同配置下结果。

Sequence Diagram(s)

sequenceDiagram
  participant KVMHost
  participant AnsibleDeploy
  participant EchoStep
  KVMHost->>KVMHost: 初始化 restartLibvirtdDuringDeploy=false
  KVMHost->>AnsibleDeploy: 组装并执行部署(deployArguments)
  AnsibleDeploy-->>KVMHost: 返回是否触发 restartLibvirtdDuringDeploy
  KVMHost->>EchoStep: 调用 getKvmAgentEchoTimeoutAfterDeploy(deployed, restartLibvirtdDuringDeploy)
  EchoStep-->>KVMHost: 返回 echoTimeout(默认或延长)
  KVMHost->>EchoStep: 调用 restf.echo(..., timeout=echoTimeout)
Loading

预计审查工作量

🎯 3 (Moderate) | ⏱️ ~20 分钟

小兔跳出新妙法,
libvirtd 要重启,
虚机数量来决策,
超时动态伸缩长,
echo 再也不心急! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题遵循 [scope]: 格式,长度 64 字符,未超过 72 字符限制,清晰描述了主要变更内容:KVM 主机重连时 libvirtd 重启后的 echo 超时延长。
Description check ✅ Passed 描述包含相关元数据(Jira 编号 ZSTAC-84691、Change-Id、来源 MR 编号),与代码变更相关,涵盖了问题跟踪和同步来源信息。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/haoyu.ding/fix-84691

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.42.2)
plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java

Comment @coderabbitai help to get the list of available commands and usage tips.

@MatheMatrix
Copy link
Copy Markdown
Owner

Comment from yaohua.wu:

Review: MR !9883 — ZSTAC-84691

变更概述:在 KVM 物理机重连的 echo-host 阶段引入动态 echo 超时。仅当本轮 ansible deploy 实际执行(deployed)且 deploy 参数会触发 libvirtd 重启(restartLibvirtdDuringDeploy)时,按物理机承载 VM 数动态延长超时:VM ≤ 100 用默认 RESTFacade.echoTimeout,每超 1 台 +1s,上限 180s。

结论:APPROVED

修复方向正确、范围收敛、与 Jira 设计方案完全一致,未发现阻塞合并的问题。以下为改进建议。

🟡 Warning

1. 新增的两个纯函数缺少单元测试KVMHostUtils.java:154,158

shouldRestartLibvirtdDuringDeploycalculateLibvirtRestartEchoTimeoutMillis 是纯静态函数,边界逻辑较多但无任何测试覆盖。项目已存在 test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java,建议补充以下边界用例:

  • vmCount = 100 / 101(阈值边界,应分别返回默认值 / 默认值+1s)
  • vmCount 极大时结果封顶 180s
  • REST_FACADE_ECHO_TIMEOUT ≥ 180maxExtraSeconds = 0,动态延长退化为 no-op(应返回配置值本身,不应为负)
  • init / restartLibvirtdnull、大小写混合、"false" 的组合

本 issue 标记为 5.5.22「必须解」,建议补齐测试再合入。

🟢 Suggestion

2. 超时调参常量硬编码KVMHostUtils.java:37-39

LIBVIRT_RESTART_ECHO_TIMEOUT_VM_THRESHOLD(100)、_PER_VM_SECONDS(1)、_MAX_SECONDS(180) 均为编译期常量。默认 RESTFacade.echoTimeout=60s 下,VM 数超过 220 即触顶 180s。对承载数百台 VM 的大物理机,libvirtd 重启耗时可能仍超过 180s,超时问题未必完全消除。建议将上限/阈值暴露为 GlobalProperty,便于现场不打补丁调优。

3. 确认扩展点驱动的 libvirtd 重启路径KVMHost.java:5954 附近

restartLibvirtdDuringDeploy 仅依据 deployArgumentsinit / restartLibvirtd 判断。Jira 评论提到 Hygon mdev、SPICE TLS、libvirt 升级兼容等场景也会重启 libvirtd,这些场景由 KvmHostAgentDeploymentExtensionPoint#modifyDeploymentArguments 处理。restartLibvirtdDuringDeploy 的赋值位于扩展点循环之后——若这些扩展确实通过 setRestartLibvirtd("true") 表达重启意图则已覆盖;若通过其他 ansible 变量触发,则动态超时不会生效(最坏退化为现状)。建议确认这些扩展路径均落到 restartLibvirtd 参数上。

已核对项(无问题)

  • 常规路径行为一致!deployed || !restartLibvirtdDuringDeploy 时返回 toMillis(REST_FACADE_ECHO_TIMEOUT),与旧 restf.echo(path, completion) 内部默认实现 echo(url, cb, toMillis(1), toMillis(REST_FACADE_ECHO_TIMEOUT)) 完全一致,绝大多数重连场景无回归。
  • Flow 时序正确restartLibvirtdDuringDeployapply-ansible-playbookrun() 内、runner.run() 之前赋值;deployed 在 runner 回调内赋值;echo-host 为后续 flow,取值时两者均已确定。deployed=false(runner 判定无需 deploy)时正确退回默认超时。
  • 公式与 Jira 设计一致Math.max(0, MAX - default) 正确处理了用户已配置更大全局超时的情况(不覆盖,符合「不覆盖用户显式配置」的设计意图)。
  • 上游新鲜度5.5.22 自分支点以来未改动 plugin/kvmmerge_status: can_be_merged,无 rebase 冲突风险。
  • KVMHost.java:5320@Transactional(readOnly=true) 与同文件 noStorageAccessible() 等既有 private 方法一致(ZStack AspectJ 织入,对 private/自调用有效),无问题。

🤖 Robot Reviewer

Resolves: ZSTAC-84691

Change-Id: I6668736975707575756d7665647867686e6b6b76
@MatheMatrix MatheMatrix force-pushed the sync/haoyu.ding/fix-84691 branch from 24ef2e4 to 719a2aa Compare May 15, 2026 09:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java (1)

37-39: 🏗️ Heavy lift

建议将超时调参常量改为可配置项

这三个值目前是编译期常量,现场调优需要改代码发版。建议改成 GlobalProperty(保留当前默认值),便于不同规模主机按需调整而不改变默认行为。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java` around lines 37 -
39, Replace the three compile-time constants in KVMHostUtils
(LIBVIRT_RESTART_ECHO_TIMEOUT_VM_THRESHOLD,
LIBVIRT_RESTART_ECHO_TIMEOUT_PER_VM_SECONDS,
LIBVIRT_RESTART_ECHO_TIMEOUT_MAX_SECONDS) with GlobalProperty-backed
configuration entries (keeping the current numeric values as defaults) and
update all usages in KVMHostUtils to read the values via the GlobalProperty
accessor rather than the static constants; name the properties clearly (e.g.,
libvirt.restart.echo.timeout.vmThreshold, .perVmSeconds, .maxSeconds), provide
the existing defaults, add any necessary `@GlobalProperty` or registration so they
are injectable/readable at runtime, and ensure unit tests or callers that
referenced the static fields now obtain the values from the new GlobalProperty
API.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java`:
- Around line 154-156: The boolean-string check in
shouldRestartLibvirtdDuringDeploy can mis-evaluate values with surrounding
whitespace; update the method to trim both parameters (init and restartLibvirtd)
before comparing (e.g., call trim() safely after null-checking each param) and
then use equalsIgnoreCase on the trimmed strings so values like " true " or
"\ntrue\t" are treated as true.

---

Nitpick comments:
In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java`:
- Around line 37-39: Replace the three compile-time constants in KVMHostUtils
(LIBVIRT_RESTART_ECHO_TIMEOUT_VM_THRESHOLD,
LIBVIRT_RESTART_ECHO_TIMEOUT_PER_VM_SECONDS,
LIBVIRT_RESTART_ECHO_TIMEOUT_MAX_SECONDS) with GlobalProperty-backed
configuration entries (keeping the current numeric values as defaults) and
update all usages in KVMHostUtils to read the values via the GlobalProperty
accessor rather than the static constants; name the properties clearly (e.g.,
libvirt.restart.echo.timeout.vmThreshold, .perVmSeconds, .maxSeconds), provide
the existing defaults, add any necessary `@GlobalProperty` or registration so they
are injectable/readable at runtime, and ensure unit tests or callers that
referenced the static fields now obtain the values from the new GlobalProperty
API.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 9a9b6257-b724-40cc-ba7a-66ed88f82483

📥 Commits

Reviewing files that changed from the base of the PR and between 24ef2e4 and 719a2aa.

📒 Files selected for processing (3)
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java
  • test/src/test/java/org/zstack/test/kvm/KVMHostUtilsTest.java

Comment on lines +154 to +156
public static boolean shouldRestartLibvirtdDuringDeploy(String init, String restartLibvirtd) {
return "true".equalsIgnoreCase(init) || "true".equalsIgnoreCase(restartLibvirtd);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

布尔字符串判断建议先 trim,避免误判

Line 155 当前仅做 equalsIgnoreCase,若参数值包含前后空格(如 " true ")会返回 false,可能导致本应延长的 echo 超时未生效。

建议修改
 public static boolean shouldRestartLibvirtdDuringDeploy(String init, String restartLibvirtd) {
-    return "true".equalsIgnoreCase(init) || "true".equalsIgnoreCase(restartLibvirtd);
+    String normalizedInit = init == null ? null : init.trim();
+    String normalizedRestartLibvirtd = restartLibvirtd == null ? null : restartLibvirtd.trim();
+    return "true".equalsIgnoreCase(normalizedInit) || "true".equalsIgnoreCase(normalizedRestartLibvirtd);
 }

As per coding guidelines: 注意检查来自 Message 的参数是否做过 trim,用户可能在浏览器上复制粘贴的数据带有空格、换行符等。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugin/kvm/src/main/java/org/zstack/kvm/KVMHostUtils.java` around lines 154 -
156, The boolean-string check in shouldRestartLibvirtdDuringDeploy can
mis-evaluate values with surrounding whitespace; update the method to trim both
parameters (init and restartLibvirtd) before comparing (e.g., call trim() safely
after null-checking each param) and then use equalsIgnoreCase on the trimmed
strings so values like " true " or "\ntrue\t" are treated as true.


long vmCount = Q.New(VmInstanceVO.class)
.eq(VmInstanceVO_.hostUuid, self.getUuid())
.count();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from yaohua.wu:

!=Stopped?

Assert.assertFalse(KVMHostUtils.shouldForceTlsRedeploy(true, false, false));
}

@Test
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from yaohua.wu:

应该要有个集成测试,这个UT在PR系统里不跑的

@MatheMatrix
Copy link
Copy Markdown
Owner

Comment on plugin/kvm/src/main/java/org/zstack/kvm/KVMHost.java:

Comment from yaohua.wu:

在这里加是否发生了deploy以及libvirt重启, libvirtRestarted = deployArguments.getRestartLibvirtd()


@Override
public void run(final FlowTrigger trigger, Map data) {
final long echoTimeout = getKvmAgentEchoTimeoutAfterDeploy(deployed, restartLibvirtdDuringDeploy);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from yaohua.wu:

final long echoTimeout = libvirtRestarted ? getLibvirtRestartedEchoTimeout() : TimeUnit.SECONDS.toMillis(CoreGlobalProperty.REST_FACADE_ECHO_TIMEOUT)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants