Burn is a next generation tensor library and Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
-
Updated
May 5, 2026 - Rust
Burn is a next generation tensor library and Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
An efficient concurrent graph processing system
GoPTX: Fine-grained GPU Kernel Fusion by PTX-level Instruction Flow Weaving
Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.
LAMB go brrr
MLX + Metal implementation of mHC: Manifold-Constrained Hyper-Connections by DeepSeek-AI.
Fused Triton kernels for Transformer inference: RMSNorm+RoPE, Gated MLP, and FP8 GEMM.
Assigment 3 for the "Parallel & Distributed Systems" course (ECE, AUTh) - Fall 2024
Noeris — autonomous kernel fusion discovery + Triton autotuning for LLM kernels and Gemma layer deeper fusion (A100/H100 wins).
Compile time kernels fusion and expression trees as Alpaka boost.odeint backend. This is my team project developed in collaboration with and under the supervision of HZDR.
Zero-dependency WebGPU deep learning inference engine (~50KB vs TensorFlow.js ~2MB)
Production-grade Triton kernel fusing residual add + RMSNorm + packed QKV projection into a single GPU launch for decoder-only transformer inference (Llama-3, Mistral, Qwen2). +2.4% tok/s, -1.5 GB VRAM on A10G.
WebGPU quantum many-body simulator — statevector + MPS + kernel fusion + chemistry. 160 tests, ITensor-validated, PySCF to 7 decimals. Browser-native.
ADAS sensor fusion benchmark — 11-stage fused wgpu-native vs multi-kernel PyTorch. 12-15x faster on same GPU.
High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.
Pushing fused WebGPU transformer kernels to max model size — int4, tiled FFN, Phi-3-mini 3.6B in Chrome
Add a description, image, and links to the kernel-fusion topic page so that developers can more easily learn about it.
To associate your repository with the kernel-fusion topic, visit your repo's landing page and select "manage topics."