Skip to content

Fix binary error truncation; add live pipeline progress from stderr#8

Merged
lmangani merged 3 commits intomainfrom
copilot/fix-dit-vae-error
Mar 7, 2026
Merged

Fix binary error truncation; add live pipeline progress from stderr#8
lmangani merged 3 commits intomainfrom
copilot/fix-dit-vae-error

Conversation

Copy link
Contributor

Copilot AI commented Mar 7, 2026

The original failure error was invisible: the 500-char truncation was consumed entirely by ggml-metal's informational init log (tensor API disabled for pre-M5 and pre-A19 devices — a normal log, not an error), leaving the actual crash reason cut off. A previous fix incorrectly treated this message as a GPU failure and added a CPU-only retry; that assumption was wrong and has been removed.

Changes

server/src/services/acestep.ts

  • Error truncation 500 → 2000 chars so real failure reasons are visible
  • Removed incorrect Metal tensor API auto-retry (-ngl 0 fallback) — the ggml-metal init message is informational; the GPU works correctly on M1–M4
  • runBinary gains an onLine callback — stderr is split into lines and streamed to the caller in real time; full stderr is still accumulated for error reporting
  • makeLmProgressHandler — parses ace-qwen3 stderr into job.stage / job.progress (0–50%):
    • [Phase1] step N … tok/s → 0–28%
    • [Phase1] Decode → 30%
    • [Decode] step N … total codes … tok/s → 30–50% (budget from [Phase2] max_tokens)
  • makeDitVaeProgressHandler — parses dit-vae stderr into job.stage / job.progress (50–100%):
    • [DiT] step N/M → 50–85%
    • [DiT] Total generation → 85%
    • [VAE] Tiled decode N tiles / Tiled decode done → 85–98%
  • Named progress-budget constants (PROGRESS_LM_PHASE1_MAX, PROGRESS_LM_PHASE2_END, PROGRESS_DIT_END, PROGRESS_VAE_END) make the 0–100 allocation explicit

The frontend's existing job-status polling picks up stage and progress automatically — no frontend changes needed.

.env.example

Removed incorrect Apple Silicon Metal retry guidance; kept DIT_VAE_EXTRA_ARGS / ACE_QWEN3_EXTRA_ARGS with a corrected general-purpose description.


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix error in dit-vae model initialization Fix dit-vae Metal tensor API crash on Apple Silicon M1–M4 Mar 7, 2026
Co-authored-by: lmangani <1423657+lmangani@users.noreply.github.com>
Copilot AI changed the title Fix dit-vae Metal tensor API crash on Apple Silicon M1–M4 Fix binary error truncation; add live pipeline progress from stderr Mar 7, 2026
@lmangani lmangani marked this pull request as ready for review March 7, 2026 18:00
@lmangani lmangani merged commit df4268c into main Mar 7, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants