feat: Merge kernels from vLLM and FlashInfer by drunkcoding · Pull Request #63 · EfficientMoE/MoE-Infinity

drunkcoding · 2025-07-06T19:00:20Z

Description

Fuse MoE layer kernels

Motivation

Kernel launch overhead too large

Type of Change

Bug fix
New feature
Breaking change
Documentation update

Checklist

I have read the CONTRIBUTION guide.
I have updated the tests (if applicable).
I have updated the documentation (if applicable).

* add openai api support * add test scripts, update readme, update api * Fix: Undefined Symbol Compilation Error (#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (#43) * CI: fix missing sudo in apt install (#44) * CI: fix missing sudo (#45) * CI: revert os matrix in CI (#46) * CI: add missing apt update after installing deb file (#47) * Doc: Update README example to DeepSeek and Suppress Warning (#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * feat: Introduce Local Server for OpenAI-Compatible APIs (#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * feat: set parameter to device before serving (#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * add max length in gen openai * fix cache race condition * all param init at host * add qwen3 * ubuntu lts and build * pre-commit ubuntu version * router weights update overlap * rename deepseek_v2 and reduce torch kernel launch * fix import * fix build and fix bug * fix citation linebreak * fix typo * fix dtype size * remove comments * fix example * pr update init * remove comment and unify deepseek preroute * feat: Merge kernels from vLLM and FlashInfer (#63) * new allocator * add kernel compilation * stable topk --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com>

* update table format * improve table clarity * init code commit * doc: add flashattention installation guide and change toc * feat: remove libaio dependency * remove spdlog dependency * misc: remove unused code and dependencies * misc: remove commented-out code and unused imports * fix: cuda oom due to safe tensors open * remove gcc-12 requirement * gptq disable exllama * fix: key error in offload set * add forward and call (#7) * add forward and call * fix a bug * feat: support grok-1 model * update API note and install * Feature/expert parallel (#9) * add back expert parallel by id hash * add grok ep * fix mistral typo * accom cuda copy bug * sync after compute * fix:sync to make sure that input is ready --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: luzhan <513964121@qq.com> * fix tokenizer in example * Xly/deepseek (#34) * add override QuantLinear (#29) Co-authored-by: xly <leyang.xue@ed.ac.uk> * use torch streampool * format * working deepspeed backend * fix: revert apply_rotary_pos_emb in deepseek * fix busy waiting * fix deepseek flashattn * add deepseek v3 * format and fix multigpu deepseek bug * with device caching allocator * add on-demand lock cache --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: lausannel <513964121@qq.com> * Upstream (#72) * Fix: Undefined Symbol Compilation Error (#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (#43) * CI: fix missing sudo in apt install (#44) * CI: fix missing sudo (#45) * CI: revert os matrix in CI (#46) * CI: add missing apt update after installing deb file (#47) * Doc: Update README example to DeepSeek and Suppress Warning (#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * feat: Introduce Local Server for OpenAI-Compatible APIs (#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * feat: set parameter to device before serving (#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * Chore(deps): Bump pyarrow from 12.0.0 to 14.0.1 (#69) Bumps [pyarrow](https://github.com/apache/arrow) from 12.0.0 to 14.0.1. - [Release notes](https://github.com/apache/arrow/releases) - [Commits](apache/arrow@go/v12.0.0...go/v14.0.1) --- updated-dependencies: - dependency-name: pyarrow dependency-version: 14.0.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Audit repository for stale code indicators (#71) * Fix: Undefined Symbol Compilation Error (#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (#43) * CI: fix missing sudo in apt install (#44) * CI: fix missing sudo (#45) * CI: revert os matrix in CI (#46) * CI: add missing apt update after installing deb file (#47) * Doc: Update README example to DeepSeek and Suppress Warning (#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * feat: Introduce Local Server for OpenAI-Compatible APIs (#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * feat: set parameter to device before serving (#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * Initial plan * Add mypy lint hook Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Configure mypy settings Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Adjust mypy scope Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Scope mypy checks Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> --------- Co-authored-by: Leyang Xue <s2062808@ed.ac.uk> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * feat: performance improvement and Qwen3 support (#60) * add openai api support * add test scripts, update readme, update api * Fix: Undefined Symbol Compilation Error (#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (#43) * CI: fix missing sudo in apt install (#44) * CI: fix missing sudo (#45) * CI: revert os matrix in CI (#46) * CI: add missing apt update after installing deb file (#47) * Doc: Update README example to DeepSeek and Suppress Warning (#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * feat: Introduce Local Server for OpenAI-Compatible APIs (#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * feat: set parameter to device before serving (#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * add max length in gen openai * fix cache race condition * all param init at host * add qwen3 * ubuntu lts and build * pre-commit ubuntu version * router weights update overlap * rename deepseek_v2 and reduce torch kernel launch * fix import * fix build and fix bug * fix citation linebreak * fix typo * fix dtype size * remove comments * fix example * pr update init * remove comment and unify deepseek preroute * feat: Merge kernels from vLLM and FlashInfer (#63) * new allocator * add kernel compilation * stable topk --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> * Add Claude Code GitHub Workflow (#73) * "Claude PR Assistant workflow" * "Claude Code Review workflow" * Xly/code clean (#74) * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * add max length in gen openai * fix cache race condition * all param init at host * add docker and sllm style read * wrap docker and test coverage * test * Clean up symlinks: remove unused op_builder, core/core, and move test_io to extensions * Replace core/kernel directory with symlink to extensions/kernel * seperations * remove ops dependency * Add CUTLASS fused MoE FFN kernel and supporting infrastructure - Add extensions/kernel/fused_moe_mlp.cu/h: BF16 CUTLASS 3-GEMM fused path (gate → up w/ SiLU-mul epilogue → down) with small-M and large-K tile dispatch - Add tests/cuda/test_fused_mlp_cutlass.cu: BF16 CUTLASS vs Torch-native benchmark - Integrate fused kernel into core/parallel/expert_module.cpp via ForwardHelper() - Update core/model/fused_mlp.cu/h and extensions/kernel/epilogue_utils.h - Improve core/utils: cache.h, lockfree_queue.h, simple_object_pool.h - Update tests/cuda/CMakeLists.txt with KERNEL_SRC pattern for CUTLASS tests - Update CLAUDE.md docs and setup.py build config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add prefill-decode collocation benchmark with throughput analysis Benchmarks five attention colocation strategies for serving decode and prefill requests on the same GPU time-slice: 0 serial — sequential on default stream 1 varlen-fused — single flash_attn_varlen_func (continuous batching) 2 dual-stream — two CUDA streams, no SM partition 3 green-ctx-sm — SM-partitioned green contexts (CUDA ≥ 12.4) 4 green-ctx-sm-wq — SM + work-queue balanced scope (CUDA 13.1+) Throughput analysis includes: - Separate decode-only / prefill-only baselines with TFLOPS and tok/s - Ideal-overlap bound (perfect concurrency = max(dec, pre)) - Per-mode: TFLOPS, decode tok/s, prefill tok/s, overlap efficiency - Generation-projection table: decode overhead and Δ vs serial per mode CUDA 13.1 green context API notes (driver 590.x): - CUdevResourceDesc is a pointer typedef (c_void_p), not a struct - cuGreenCtxStreamCreate requires CU_STREAM_NON_BLOCKING flag - CU_DEV_RESOURCE_TYPE_WORKQUEUE_CONFIG = 1000; configure sharingScope to CU_WORKQUEUE_SCOPE_GREEN_CTX_BALANCED for WQ isolation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Bump pydantic and transformers to resolve Dependabot alerts - pydantic 1.10.12 → 1.10.13: fixes ReDoS in email validation (GHSA-mr82-8j83-vxmv) - transformers 4.51.3 → 4.53.0: fixes 14 alerts including 3 HIGH RCE (GHSA-wrfc-pvp9-mr9g, GHSA-hxxf-235m-72v3, GHSA-qxrp-vhvm-j765) and 11 MEDIUM/LOW ReDoS vulnerabilities - Remove torch==2.3.1 pin (managed by conda env / base image) - Add flash-attn to requirements Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix build * Make readme_example.py testable via --help Add argparse to readme_example.py so that model-loading code runs only after parse_args(), allowing `--help` to exit 0 without a GPU or model. Replace the AST-only test_readme_example_syntax with test_readme_example_help, which mirrors the existing test_interface_example_help pattern and is verified passing in Docker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix format CI and build-test CI - requirements.txt: sort flash-attn alphabetically (between fastapi and hjson) so requirements-txt-fixer pre-commit hook passes - build-test.yml: replace Ubuntu 20.04 CUDA container (Python 3.8, broken PyTorch wheel) with actions/setup-python Python 3.10 + CPU-only torch; switch from full wheel build to sdist-only (--no-isolation) to avoid CUTLASS dependency and 20+ min compile time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * green ctx bench * Fix CI: guard CUDA extensions behind cuda_available, add statics to codespell ignore - setup.py: only build CUDAExtension when torch.version.cuda is set; the build-test CI installs CPU-only torch and lacks CUDA_HOME, causing CUDAExtension to abort with OSError - .pre-commit-config.yaml: add 'statics' to codespell ignore-words-list; the term is valid C++ (module-level static variables) but was flagged as a misspelling of 'statistics' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tests update --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * remove claude * format * resolve review * resolve reviews --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: lausannel <513964121@qq.com> Co-authored-by: Yao Fu <fuyao3860@gmail.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…cientMoE#75) * update table format * improve table clarity * init code commit * doc: add flashattention installation guide and change toc * feat: remove libaio dependency * remove spdlog dependency * misc: remove unused code and dependencies * misc: remove commented-out code and unused imports * fix: cuda oom due to safe tensors open * remove gcc-12 requirement * gptq disable exllama * fix: key error in offload set * add forward and call (EfficientMoE#7) * add forward and call * fix a bug * feat: support grok-1 model * update API note and install * Feature/expert parallel (EfficientMoE#9) * add back expert parallel by id hash * add grok ep * fix mistral typo * accom cuda copy bug * sync after compute * fix:sync to make sure that input is ready --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: luzhan <513964121@qq.com> * fix tokenizer in example * Xly/deepseek (EfficientMoE#34) * add override QuantLinear (EfficientMoE#29) Co-authored-by: xly <leyang.xue@ed.ac.uk> * use torch streampool * format * working deepspeed backend * fix: revert apply_rotary_pos_emb in deepseek * fix busy waiting * fix deepseek flashattn * add deepseek v3 * format and fix multigpu deepseek bug * with device caching allocator * add on-demand lock cache --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: lausannel <513964121@qq.com> * Upstream (EfficientMoE#72) * Fix: Undefined Symbol Compilation Error (EfficientMoE#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (EfficientMoE#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (EfficientMoE#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (EfficientMoE#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (EfficientMoE#43) * CI: fix missing sudo in apt install (EfficientMoE#44) * CI: fix missing sudo (EfficientMoE#45) * CI: revert os matrix in CI (EfficientMoE#46) * CI: add missing apt update after installing deb file (EfficientMoE#47) * Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (EfficientMoE#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * feat: Introduce Local Server for OpenAI-Compatible APIs (EfficientMoE#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * feat: set parameter to device before serving (EfficientMoE#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * Chore(deps): Bump pyarrow from 12.0.0 to 14.0.1 (EfficientMoE#69) Bumps [pyarrow](https://github.com/apache/arrow) from 12.0.0 to 14.0.1. - [Release notes](https://github.com/apache/arrow/releases) - [Commits](apache/arrow@go/v12.0.0...go/v14.0.1) --- updated-dependencies: - dependency-name: pyarrow dependency-version: 14.0.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Audit repository for stale code indicators (EfficientMoE#71) * Fix: Undefined Symbol Compilation Error (EfficientMoE#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (EfficientMoE#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (EfficientMoE#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (EfficientMoE#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (EfficientMoE#43) * CI: fix missing sudo in apt install (EfficientMoE#44) * CI: fix missing sudo (EfficientMoE#45) * CI: revert os matrix in CI (EfficientMoE#46) * CI: add missing apt update after installing deb file (EfficientMoE#47) * Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (EfficientMoE#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * feat: Introduce Local Server for OpenAI-Compatible APIs (EfficientMoE#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * feat: set parameter to device before serving (EfficientMoE#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * Initial plan * Add mypy lint hook Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Configure mypy settings Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Adjust mypy scope Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * Scope mypy checks Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> --------- Co-authored-by: Leyang Xue <s2062808@ed.ac.uk> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> * feat: performance improvement and Qwen3 support (EfficientMoE#60) * add openai api support * add test scripts, update readme, update api * Fix: Undefined Symbol Compilation Error (EfficientMoE#37) * reformat code vllm style * add threadsafe queues * fix compilation error --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * Refactor code for better performance (EfficientMoE#38) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: add pre commit format ci (EfficientMoE#40) * ci: add pre commit format ci * fix: add requirements for linting * fix: format code before merge * fix: update local clang format version * Chore: rename organization name & optimize CI (EfficientMoE#41) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: fix not a git repository in CI (EfficientMoE#43) * CI: fix missing sudo in apt install (EfficientMoE#44) * CI: fix missing sudo (EfficientMoE#45) * CI: revert os matrix in CI (EfficientMoE#46) * CI: add missing apt update after installing deb file (EfficientMoE#47) * Doc: Update README example to DeepSeek and Suppress Warning (EfficientMoE#49) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * CI: do not build test if document update (EfficientMoE#52) * reformat code vllm style * add threadsafe queues * fix compilation error * split files and remove queuing * performance improvement * remove error dependency * add try lock return check * fix header dependency * fix hard coded number * update CI using cuda docker image * repo consistency * pr template fix * format doc * delete gpu option, add --no-install-recommends * add cuda matrix and remove cuda full package install * remove publish container * change team name to efficient moe * update readme example to deepseek and supress warning * format * revert CI changes to main version * update readme conda env and ignore doc update in build and release * fix wildcard --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * feat: Introduce Local Server for OpenAI-Compatible APIs (EfficientMoE#4) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * feat: set parameter to device before serving (EfficientMoE#56) * update table format * improve table clarity * init code commit * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Yao <fuyao3860@gmail.com> * add max length in gen openai * fix cache race condition * all param init at host * add qwen3 * ubuntu lts and build * pre-commit ubuntu version * router weights update overlap * rename deepseek_v2 and reduce torch kernel launch * fix import * fix build and fix bug * fix citation linebreak * fix typo * fix dtype size * remove comments * fix example * pr update init * remove comment and unify deepseek preroute * feat: Merge kernels from vLLM and FlashInfer (EfficientMoE#63) * new allocator * add kernel compilation * stable topk --------- Co-authored-by: xly <leyang.xue@ed.ac.uk> --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> * Add Claude Code GitHub Workflow (EfficientMoE#73) * "Claude PR Assistant workflow" * "Claude Code Review workflow" * Xly/code clean (EfficientMoE#74) * add openai api support * add test scripts, update readme, update api * format and change to deepseek in example * fix format * remove unused files * fix api server token id device * fix gen broken * update readme links * cancel concurrent job * set dense node to device * sparse node set cpu * remove OS def * use update to date clang-format * fix setuptools version * fix setuptools version for python 3.8 * keep single cuda version in publish * add max length in gen openai * fix cache race condition * all param init at host * add docker and sllm style read * wrap docker and test coverage * test * Clean up symlinks: remove unused op_builder, core/core, and move test_io to extensions * Replace core/kernel directory with symlink to extensions/kernel * seperations * remove ops dependency * Add CUTLASS fused MoE FFN kernel and supporting infrastructure - Add extensions/kernel/fused_moe_mlp.cu/h: BF16 CUTLASS 3-GEMM fused path (gate → up w/ SiLU-mul epilogue → down) with small-M and large-K tile dispatch - Add tests/cuda/test_fused_mlp_cutlass.cu: BF16 CUTLASS vs Torch-native benchmark - Integrate fused kernel into core/parallel/expert_module.cpp via ForwardHelper() - Update core/model/fused_mlp.cu/h and extensions/kernel/epilogue_utils.h - Improve core/utils: cache.h, lockfree_queue.h, simple_object_pool.h - Update tests/cuda/CMakeLists.txt with KERNEL_SRC pattern for CUTLASS tests - Update CLAUDE.md docs and setup.py build config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Add prefill-decode collocation benchmark with throughput analysis Benchmarks five attention colocation strategies for serving decode and prefill requests on the same GPU time-slice: 0 serial — sequential on default stream 1 varlen-fused — single flash_attn_varlen_func (continuous batching) 2 dual-stream — two CUDA streams, no SM partition 3 green-ctx-sm — SM-partitioned green contexts (CUDA ≥ 12.4) 4 green-ctx-sm-wq — SM + work-queue balanced scope (CUDA 13.1+) Throughput analysis includes: - Separate decode-only / prefill-only baselines with TFLOPS and tok/s - Ideal-overlap bound (perfect concurrency = max(dec, pre)) - Per-mode: TFLOPS, decode tok/s, prefill tok/s, overlap efficiency - Generation-projection table: decode overhead and Δ vs serial per mode CUDA 13.1 green context API notes (driver 590.x): - CUdevResourceDesc is a pointer typedef (c_void_p), not a struct - cuGreenCtxStreamCreate requires CU_STREAM_NON_BLOCKING flag - CU_DEV_RESOURCE_TYPE_WORKQUEUE_CONFIG = 1000; configure sharingScope to CU_WORKQUEUE_SCOPE_GREEN_CTX_BALANCED for WQ isolation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Bump pydantic and transformers to resolve Dependabot alerts - pydantic 1.10.12 → 1.10.13: fixes ReDoS in email validation (GHSA-mr82-8j83-vxmv) - transformers 4.51.3 → 4.53.0: fixes 14 alerts including 3 HIGH RCE (GHSA-wrfc-pvp9-mr9g, GHSA-hxxf-235m-72v3, GHSA-qxrp-vhvm-j765) and 11 MEDIUM/LOW ReDoS vulnerabilities - Remove torch==2.3.1 pin (managed by conda env / base image) - Add flash-attn to requirements Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix build * Make readme_example.py testable via --help Add argparse to readme_example.py so that model-loading code runs only after parse_args(), allowing `--help` to exit 0 without a GPU or model. Replace the AST-only test_readme_example_syntax with test_readme_example_help, which mirrors the existing test_interface_example_help pattern and is verified passing in Docker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix format CI and build-test CI - requirements.txt: sort flash-attn alphabetically (between fastapi and hjson) so requirements-txt-fixer pre-commit hook passes - build-test.yml: replace Ubuntu 20.04 CUDA container (Python 3.8, broken PyTorch wheel) with actions/setup-python Python 3.10 + CPU-only torch; switch from full wheel build to sdist-only (--no-isolation) to avoid CUTLASS dependency and 20+ min compile time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * green ctx bench * Fix CI: guard CUDA extensions behind cuda_available, add statics to codespell ignore - setup.py: only build CUDAExtension when torch.version.cuda is set; the build-test CI installs CPU-only torch and lacks CUDA_HOME, causing CUDAExtension to abort with OSError - .pre-commit-config.yaml: add 'statics' to codespell ignore-words-list; the term is valid C++ (module-level static variables) but was flagged as a misspelling of 'statistics' Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tests update --------- Co-authored-by: Yao <fuyao3860@gmail.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * remove claude * format * resolve review * resolve reviews --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: xly <leyang.xue@ed.ac.uk> Co-authored-by: Zhan Lu <51200935+lausannel@users.noreply.github.com> Co-authored-by: lausannel <513964121@qq.com> Co-authored-by: Yao Fu <fuyao3860@gmail.com> Co-authored-by: Yao Fu <yao.fu.aisys@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: drunkcoding <14305648+drunkcoding@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

xly added 4 commits June 5, 2025 21:55

new allocator

3e263e6

add kernel compilation

54407b1

stable topk

7c5918b

Merge branch 'feature/qwen' into feature/fastinfer

8da7e2b

drunkcoding merged commit d4e80c3 into feature/qwen Feb 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Merge kernels from vLLM and FlashInfer#63

feat: Merge kernels from vLLM and FlashInfer#63
drunkcoding merged 4 commits intofeature/qwenfrom
feature/fastinfer

drunkcoding commented Jul 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drunkcoding commented Jul 6, 2025

Description

Motivation

Type of Change

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant