This is the official code for the Quartet II NVFP4 training paper
Create a conda environment and install dependencies (we recommend Python 3.11):
conda create -n env python=3.11
conda activate envpip install -r requirements.txtReproduce Quartet II sweeps in SLURM:
cd scripts
sbatch quartetv2_sweep.shInspect the scheme implementation at:
[quartet_2.py](./src/models/quantization/schemes/quartet_2.py)
We provide the kernels tuned for RTX 5090 (sm120a) in ./kernels. They require CUDA 12.8 or newer and close to latest (~2.9.0) pytorch. Install them with
cd kernels
pip install --no-build-isolation .You can then use the provided drop-in NVFP4 nn.Linear replacement as follows:
from quartet2.linear import Quartet_II_linear
linear = Quartet_II_linear(
in_dim,
out_dim,
device="cuda",
dtype=torch.bfloat16,
)
...You can further benchmark the kernels agains BF16, FP8 and Quartet with
cd test
pythpn bench_linear.py@misc{panferov2026quartetiiaccuratellm,
title={Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation},
author={Andrei Panferov and Erik Schultheis and Soroush Tabesh and Dan Alistarh},
year={2026},
eprint={2601.22813},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.22813},
}