fix: cap BLAS/OpenMP threads — concurrency fix (#316)#359
Merged
Conversation
Set OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, and NUMEXPR_MAX_THREADS to "1" via setdefault in both model_server.py and mcp_server.py before any model imports. Without this, N concurrent MCP processes each spawn N BLAS threads, causing N-squared context switch collapse (measured 19-minute hangs). Using setdefault so users can override if needed. Closes #316
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OMP_NUM_THREADS=1,MKL_NUM_THREADS=1,OPENBLAS_NUM_THREADS=1,NUMEXPR_MAX_THREADS=1viasetdefaultin bothmodel_server.pyandmcp_server.pysetdefaultso user overrides are respectedWhat this fixes
N concurrent MCP processes × N BLAS threads = N² context switches. With 6 sessions: 60 threads fighting for 10 cores. Measured 200,000x slowdown with 19-minute hangs on
m.add().Test evidence
Risk
Single-threaded BLAS is slightly slower for large batch operations (tier switch vector rebuild). Negligible for normal single-query inference.
Closes #316