SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
-
Updated
May 4, 2026 - Python
SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime
[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLMs, VLMs, and video generative models.
vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.
Native Windows build of vLLM 0.19.1 — no WSL, no Docker. Pre-built wheels + 34-file Windows patch + Multi-TurboQuant KV cache compression (6 methods, 2x cache capacity). PyTorch 2.10 + CUDA 12.6 + Triton + Flash-Attention 2.
Compress Any LLM Up to 6x in One Command. Unified CLI for GGUF, GPTQ, and AWQ quantization.
Research Test: REAP expert pruning + AWQ quantization of Qwen3-Coder-Next MoE model
[ICLR2026] When Reasoning Meets Compression: Understanding the Effects of LLMs Compression on Large Reasoning Models. Support interpretation of Qwen, Llama, etc.
Quantize LLM using AWQ
Dockerized vLLM serving for Kimi-Linear-48B-A3B (AWQ-4bit), from 128K to 1M context.
Effortlessly quantize, benchmark, and publish Hugging Face models with cross-platform support for CPU/GPU. Reduce model size by 75% while maintaining performance.
Quantization quality analyzer - benchmark GGUF/GPTQ/AWQ quantization accuracy.
Artificial Personality is text2text AI chatbot that can use character cards
Self-hosted LLM chat client with streaming UI for vLLM servers. Run Mistral-24B locally on RTX 4090/3090. Privacy-focused ChatGPT alternative for homelab/gaming PCs. Python/Rich terminal UI.
White paper & reproducible benchmark suite for LLM inference optimization on AMD MI300X using ROCm 6.1
AWQ Quantization of Microsoft/Phi-4-Reasoning
Pure Gleam tensor library with quantization (INT8, NF4, AWQ), Flash Attention, and 2:4 Sparsity - 7.5x memory multiplication
Add a description, image, and links to the awq topic page so that developers can more easily learn about it.
To associate your repository with the awq topic, visit your repo's landing page and select "manage topics."