turboquant

Star

Here are 77 public repositories matching this topic...

quantumaikr / quant.cpp

Star

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

embeddable transformer pure-c quantization delta-compression kv-cache llm llm-inference gguf turboquant

Updated Apr 14, 2026
C

arozanov / turboquant-mlx

Star

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

metal quantization mlx kv-cache apple-silicon llm turboquant

Updated Apr 2, 2026
Python

Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease

cuda turboquant quansloth vram-wall

Updated Apr 12, 2026
Python

Alberto-Codes / turboquant-vllm

Star

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

compression transformer triton quantization inference-optimization kv-cache llm vllm consumer-gpu turboquant

Updated Apr 10, 2026
Python

back2matching / turboquant

Star

First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.

machine-learning compression gpu transformers inference pytorch quantization vram huggingface kv-cache llm turboquant

Updated Mar 30, 2026
Python

mindtro / semafold

Star

Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).

retrieval quantization vector-database kv-cache llm-inference embedding-compression turboquant vector-compression qjl semafold

Updated Apr 1, 2026
Python

Firmamento-Technologies / TurboQuant

Star

Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement

Updated Mar 28, 2026
Python

rookiemann / multi-turboquant

Star

Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU planner. Compress KV cache 5-80x to run bigger models, longer context, more agents on your GPU.

Updated Apr 12, 2026
Python

artalis-io / bitnet.c

Star

Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.

c neon wasm inference simd moe avx2 quantization kv-cache cpu-inference llm gguf turboquant

Updated Mar 28, 2026
C

Lucien2468 / Ollama-TurboQuant-Integration

Star

TurboQuant: Native 3-Bit Quantization for Ollama - Achieve 25-28% better compression than Q4_0 while maintaining high-speed CPU inference. Experimentally integrated into Ollama with custom GGML kernels for LLM efficiency.

llama quantization ggml ollama turboquant

Updated Apr 4, 2026
Go

jaylfc / tinyagentos

Star

Self-hosted AI agent OS for Orange Pi and consumer hardware. Desktop shell, app store, agent deployment, distributed compute cluster. Memory by taOSmd.

raspberry-pi distributed-computing self-hosted orange-pi ai-agents ai-platform agent-framework apple-silicon llm vllm local-llm llm-inference kv-cache-quantization rockchip-npu turboquant

Updated Apr 14, 2026
Python

zlaabsi / turboquant-wasm

Star

TurboQuant vector quantization for browser and edge runtimes

browser wasm quantization semantic-search webgpu rag vector-search edge-ai edge-runtime turboquant

Updated Apr 9, 2026
JavaScript

Argonaut790 / fused-turboquant

Star

Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.

Updated Apr 1, 2026
Python

yzamari / turboQuantPlayground

Star

TurboQuant (ICLR 2026) ported to Apple Silicon — KV cache compression with MLX Metal kernels + PyTorch CPU

machine-learning deep-learning metal transformers inference pytorch attention quantization mlx iclr kv-cache apple-silicon llm llm-inference turboquant

Updated Mar 31, 2026
Python

carlosfundora / llama.cpp-1-bit-turbo

Star

HIP/ROCm fork optimized for AMD RDNA2 (gfx1030) with PrismML Q1_0_G128 1-bit quant support, RotorQuant, TurboQuant, EAGLE3 and P-EAGLE speculative decoding, and full Wave32 kernel optimizations.

hip quantization bonsai rocm amd-gpu llama-cpp gguf rdna2 turboquant prismml gfx1030

Updated Apr 13, 2026
C++

jimliddle / turboquant-amd-vulkan

Star

A TurboQuant implementation with Llama.cpp for AMD with Vulkan runtime

amd vulkan llms kvcache turboquant

Updated Apr 1, 2026
C++

rookiemann / vllm-windows-build

Star

Native Windows build of vLLM 0.19.0 — no WSL, no Docker. Pre-built wheels + 33-file Windows patch + Multi-TurboQuant KV cache compression (6 methods, 2x cache capacity). PyTorch 2.10 + CUDA 12.6 + Triton + Flash-Attention 2.

Updated Apr 12, 2026
Python

Sggin1 / DGX-SPARK

Star

DGX Spark research and tests - containers, benchmarks, and investigation notes for running models on GB10 (SM 12.1)

aarch64 blackwell kv-cache vllm nvfp4 dgx-spark mamba-ssm sm121 turboquant

Updated Apr 12, 2026
Python

savka777 / orbit

Star

your ai, your rules. — local AI desktop app with hardware-aware model matching, threaded conversations, and TurboQuant integration. no cloud, no subscription, no data leaving your device.

electron desktop-app macos open-source privacy ai llm local-ai ollama turboquant

Updated Mar 30, 2026
TypeScript

carlosfundora / sglang-1-bit-turbo

Star

AMD ROCm (gfx1030) inference fork with RotorQuant/TurboQuant KV compression, PHANTOM-X zero-copy draft speculation, EAGLE3 speculative decoding, 12 RDNA2 crash fixes, and PrismML Bonsai Q1_0_G128 1-bit GGUF support.

triton hip bonsai rocm amd-gpu gguf speculative-decoding sglang rdna2 eagle3 turboquant prismml gfx1030 p-eagle radix-cache

Updated Apr 13, 2026
Python

Improve this page

Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turboquant

Here are 77 public repositories matching this topic...

quantumaikr / quant.cpp

arozanov / turboquant-mlx

PacifAIst / Quansloth

Alberto-Codes / turboquant-vllm

back2matching / turboquant

mindtro / semafold

Firmamento-Technologies / TurboQuant

rookiemann / multi-turboquant

artalis-io / bitnet.c

Lucien2468 / Ollama-TurboQuant-Integration

jaylfc / tinyagentos

zlaabsi / turboquant-wasm

Argonaut790 / fused-turboquant

yzamari / turboQuantPlayground

carlosfundora / llama.cpp-1-bit-turbo

jimliddle / turboquant-amd-vulkan

rookiemann / vllm-windows-build

Sggin1 / DGX-SPARK

savka777 / orbit

carlosfundora / sglang-1-bit-turbo

Improve this page

Add this topic to your repo