[Cherry-Pick][BugFix] prevent requests from entering running state without a slot (#7141) by liyonghua0910 · Pull Request #7160 · PaddlePaddle/FastDeploy

liyonghua0910 · 2026-04-02T09:43:48Z

Motivation

Cherry-pick the scheduler slot-accounting fixes from develop PR #7141 to release/2.4.

This keeps scheduler occupancy consistent on the release branch when requests are running, pending reschedule, or already preempted back into the waiting queue, and preserves the slot guard that prevents over-admission.

Modifications

Cherry-pick ea3854523f9a69b01e4194e51bc1e2d037e71d23: prevent requests from entering running state without a slot.
Cherry-pick 3e4360b2553391850467b544757022ab02f99c3c: fix abort-set counting.
Cherry-pick dc46505dbf06f3f95c565d53e57fb459e224a87b: count preempted tasks in the waiting list.
Applied cleanly on top of upstream/release/2.4 in branch release/2.4+20260402_fix_schedule.

Usage or Command

Bugfix validation command not run in this session.

Suggested reproduction:

Run a workload where requests are frequently rescheduled or preempted while the scheduler is near max_num_seqs.
Verify that no request enters running when the effective occupied slots have already reached capacity.

Accuracy Tests

This cherry-pick does not change model forward, kernel logic, or numerical computation paths.

No accuracy impact is expected.

Checklist

Add at least a tag in the PR title.
Format your code, run pre-commit before commit.
Add unit tests. No new unit tests were added in this cherry-pick.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-02T09:44:00Z

Thanks for your contribution!

fastdeploy-bot

🤖 AI Code Review | 2026-04-02 20:00 CST

📋 Review 摘要

PR 概述：Cherry-pick 调度器 slot 计数修复到 release/2.4 分支，防止请求在无可用 slot 时进入 running 状态
变更范围：common_engine.py（异常处理）、resource_manager_v1.py（调度逻辑）
影响面 Tag：Engine Scheduler

问题

级别	文件	概述
❓ 疑问	`common_engine.py:912`	RuntimeError 异常处理行为变更，需确认意图

总体评价

resource_manager_v1.py 的 slot 计数修复逻辑清晰正确，现在综合考虑了 running、待重调度和已抢占的请求数量。但 common_engine.py 中 RuntimeError 的处理从"优雅退出"变为"直接抛出"，这个行为变更与 PR 描述中的 slot accounting 修复目标关联不明显，建议作者确认。

fastdeploy-bot · 2026-04-02T12:00:05Z

fastdeploy/engine/common_engine.py

            except RuntimeError as e:
-                if "cannot schedule new futures after shutdown" in str(e):
-                    break
+                raise e


❓ 疑问 RuntimeError 异常处理行为变更

原代码在收到 "cannot schedule new futures after shutdown" 错误时会优雅地 break 退出循环，现在改为直接 raise e 抛出异常。

疑问点：

这个变更会导致 shutdown 时抛出异常而非静默退出，是否是预期行为？

此变更与 PR 描述中的 "slot accounting" 修复目标的关联是什么？

是否会影响服务的正常关闭流程？

请确认这个行为变更是否符合预期。

fastdeploy-bot · 2026-04-02T12:00:05Z

fastdeploy/engine/sched/resource_manager_v1.py

+                    if (
+                        len(self.running)
+                        + len(self.to_be_rescheduled_request_id_set)
+                        + sum([req.status == RequestStatus.PREEMPTED for req in self.waiting])


🟢 LGTM slot 计数逻辑修复

现在正确地将以下三类请求纳入 slot 占用计算：

self.running：正在运行的请求

self.to_be_rescheduled_request_id_set：待重新调度的请求

PREEMPTED 状态的 waiting 请求：已被抢占等待恢复的请求

同时将 == 改为 >= 是更安全的边界条件检查。

小建议（非阻塞）：可考虑将 sum([req.status == ... for req in self.waiting]) 改为生成器表达式 sum(req.status == ... for req in self.waiting) 以节省内存分配。

codecov-commenter · 2026-04-02T12:00:57Z

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.4@92abc14). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/engine/common_engine.py	0.00%	1 Missing ⚠️
fastdeploy/engine/sched/resource_manager_v1.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.4    #7160   +/-   ##
==============================================
  Coverage               ?   56.39%           
==============================================
  Files                  ?      333           
  Lines                  ?    42624           
  Branches               ?     6485           
==============================================
  Hits                   ?    24036           
  Misses                 ?    16704           
  Partials               ?     1884

Flag	Coverage Δ
GPU	`56.39% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liyonghua0910 added 3 commits April 2, 2026 09:36

[fix] prevent requests from entering running state without a slot

892d67e

[fix] count abort set

02bdce0

[fix] count preempted task in waiting list

aa7c84b

liyonghua0910 had a problem deploying to Metax_ci April 2, 2026 09:43 — with GitHub Actions Failure

[fix] fix abort for 2.4

f4ecbee

liyonghua0910 had a problem deploying to Metax_ci April 2, 2026 09:53 — with GitHub Actions Failure

fastdeploy-bot reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][BugFix] prevent requests from entering running state without a slot (#7141)#7160

[Cherry-Pick][BugFix] prevent requests from entering running state without a slot (#7141)#7160
liyonghua0910 wants to merge 4 commits intoPaddlePaddle:release/2.4from
liyonghua0910:release/2.4+20260402_fix_schedule

liyonghua0910 commented Apr 2, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Apr 2, 2026

Uh oh!

fastdeploy-bot left a comment

Uh oh!

fastdeploy-bot Apr 2, 2026

Uh oh!

fastdeploy-bot Apr 2, 2026

Uh oh!

codecov-commenter commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liyonghua0910 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 2, 2026

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

总体评价

Uh oh!

fastdeploy-bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 2, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liyonghua0910 commented Apr 2, 2026 •

edited

Loading