🎉 This is the official implementation of our AAAI 2026 paper:
PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement.
Three checkpoints are provided:
DeWavLM.tarVocoder-L24.tarVocoder-Dual.tar
DeWavLM.tar and Vcooder-Dual.tar together form the PASE model.
Note that the released checkpoint is trained on a relatively small dataset, including:
- Speech: DNS5, LibriTTS, VCTK
- Noise: DNS5
- RIRs: OpenSLR26, OpenSLR28
The performance of the retrained version compared to the original one:
| Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) |
|---|---|---|---|---|---|---|
| Vocoder-L24 (orig.) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 |
| Vocoder-L24 (retrained) | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 |
| DeWavLM (orig.) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 |
| DeWavLM (retrained) | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25 |
| PASE (orig.) | 3.12 | 3.09 | 0.90 | 0.93 | 0.80 | 7.49 |
| PASE (retrained) | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 |
It can be seen that the retrained version achieves performance very close to that of the original version on our simulated test set.
Note: The Vocoder-L24 (retrained) was trained for only 60 epochs (30k iterations), as we found that it tends to overfit on such a small training set.
To run inference on audio files, use:
python -m inference.inference -I <input_dir> -O <output_dir> [options]| Argument | Requirement / Default | Description |
|---|---|---|
-I (--input_dir) |
required | Path to the input directory containing audio files. |
-O (--output_dir) |
required | Path to the output directory where enhanced files will be saved. |
-D (--device) |
default: cuda:0 |
Torch device to run inference on, e.g., cuda:0, cuda:1, or cpu. |
-E (--extension) |
default: .wav |
Audio file extension to process. |
Audio examples are provided in ./test/audio_enh.
-
training script:
train/train_vocoder.py -
training configuration:
configs/cfg_train_vocoder.yamlpython -m train.train_vocoder -C configs/cfg_train_vocoder.yaml -D 0,1,2,3
-
inference script:
inference/infer_vocoder.pypython -m inference.infer_vocoder -C configs/cfg_infer.yaml -D 0
This step aims to pre-train a vocoder using the 24th-layer WavLM representations. The pre-trained single-stream vocoder is then used in Step 2 to reconstruct waveforms, enabling the evaluation of DeWavLM’s performance.
- training script:
train/train_dewavlm.py - training configuration:
configs/cfg_train_dewavlm.yaml - inference script:
inference/infer_dewavlm.py
(The usage is the same as in Step 1.)
This step aims to obtain a denoised WavLM (DeWavLM) via knowledge distillation, referred to in the paper as denoising representation distillation (DRD).
- training script:
train/train_vocoder_dual.py - training configuration:
configs/cfg_train_vocoder_dual.yaml - inference script:
inference/infer_vocoder_dual.py
(The usage is the same as in Step 1.)
This step trains the final dual-stream vocoder, which takes the acoustic (1st-layer) and phonetic (24th-layer) DeWavLM representations as inputs and produces the final enhanced waveform.
Once all training steps are completed, the corresponding checkpoints can be prepared for inference:
utils/create_ckpt_wavlm.pyis used to create a DeWavLM checkpoint.utils/create_ckpt.pyis used to create a Vocoder-L24 or Vocoder-Dual checkpoint.
If you find this work useful, please cite our paper:
@article{PASE,
title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
volume={40},
DOI={10.1609/aaai.v40i39.40562},
number={39},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing},
year={2026},
month={Mar.},
pages={32826-32834} }Xiaobin Rong: xiaobin.rong@smail.nju.edu.cn
Mansur Yesilbursa: myesilbu@cisco.com