Skip to content

[Cherry-Pick][BugFix] prevent requests from entering running state without a slot (#7141)#7160

Open
liyonghua0910 wants to merge 4 commits intoPaddlePaddle:release/2.4from
liyonghua0910:release/2.4+20260402_fix_schedule
Open

[Cherry-Pick][BugFix] prevent requests from entering running state without a slot (#7141)#7160
liyonghua0910 wants to merge 4 commits intoPaddlePaddle:release/2.4from
liyonghua0910:release/2.4+20260402_fix_schedule

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

@liyonghua0910 liyonghua0910 commented Apr 2, 2026

Motivation

Cherry-pick the scheduler slot-accounting fixes from develop PR #7141 to release/2.4.

This keeps scheduler occupancy consistent on the release branch when requests are running, pending reschedule, or already preempted back into the waiting queue, and preserves the slot guard that prevents over-admission.

Modifications

  • Cherry-pick ea3854523f9a69b01e4194e51bc1e2d037e71d23: prevent requests from entering running state without a slot.
  • Cherry-pick 3e4360b2553391850467b544757022ab02f99c3c: fix abort-set counting.
  • Cherry-pick dc46505dbf06f3f95c565d53e57fb459e224a87b: count preempted tasks in the waiting list.
  • Applied cleanly on top of upstream/release/2.4 in branch release/2.4+20260402_fix_schedule.

Usage or Command

Bugfix validation command not run in this session.

Suggested reproduction:

  • Run a workload where requests are frequently rescheduled or preempted while the scheduler is near max_num_seqs.
  • Verify that no request enters running when the effective occupied slots have already reached capacity.

Accuracy Tests

This cherry-pick does not change model forward, kernel logic, or numerical computation paths.

No accuracy impact is expected.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. No new unit tests were added in this cherry-pick.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 2, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-02 20:00 CST

📋 Review 摘要

PR 概述:Cherry-pick 调度器 slot 计数修复到 release/2.4 分支,防止请求在无可用 slot 时进入 running 状态
变更范围common_engine.py(异常处理)、resource_manager_v1.py(调度逻辑)
影响面 TagEngine Scheduler

问题

级别 文件 概述
❓ 疑问 common_engine.py:912 RuntimeError 异常处理行为变更,需确认意图

总体评价

resource_manager_v1.py 的 slot 计数修复逻辑清晰正确,现在综合考虑了 running、待重调度和已抢占的请求数量。但 common_engine.py 中 RuntimeError 的处理从"优雅退出"变为"直接抛出",这个行为变更与 PR 描述中的 slot accounting 修复目标关联不明显,建议作者确认。

except RuntimeError as e:
if "cannot schedule new futures after shutdown" in str(e):
break
raise e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 RuntimeError 异常处理行为变更

原代码在收到 "cannot schedule new futures after shutdown" 错误时会优雅地 break 退出循环,现在改为直接 raise e 抛出异常。

疑问点

  1. 这个变更会导致 shutdown 时抛出异常而非静默退出,是否是预期行为?
  2. 此变更与 PR 描述中的 "slot accounting" 修复目标的关联是什么?
  3. 是否会影响服务的正常关闭流程?

请确认这个行为变更是否符合预期。

if (
len(self.running)
+ len(self.to_be_rescheduled_request_id_set)
+ sum([req.status == RequestStatus.PREEMPTED for req in self.waiting])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 LGTM slot 计数逻辑修复

现在正确地将以下三类请求纳入 slot 占用计算:

  • self.running:正在运行的请求
  • self.to_be_rescheduled_request_id_set:待重新调度的请求
  • PREEMPTED 状态的 waiting 请求:已被抢占等待恢复的请求

同时将 == 改为 >= 是更安全的边界条件检查。

小建议(非阻塞):可考虑将 sum([req.status == ... for req in self.waiting]) 改为生成器表达式 sum(req.status == ... for req in self.waiting) 以节省内存分配。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.4@92abc14). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 0.00% 1 Missing ⚠️
fastdeploy/engine/sched/resource_manager_v1.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.4    #7160   +/-   ##
==============================================
  Coverage               ?   56.39%           
==============================================
  Files                  ?      333           
  Lines                  ?    42624           
  Branches               ?     6485           
==============================================
  Hits                   ?    24036           
  Misses                 ?    16704           
  Partials               ?     1884           
Flag Coverage Δ
GPU 56.39% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants