🎵 Demo • 📄 Paper (Coming Soon) • 📚 Citation
This repository is the official implementation for "SegTune: Structured and Fine-Grained Control for Song Generation". In this repository, we provide the training and inference scripts of the SegTune model, and evaluation pipelines.
Recent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems fail to model temporally varying attributes of songs, severely limiting fine-grained control over musical structure and dynamics. To address this, we propose SegTune, a Diffusion Transformer-based framework enabling structured and fine-grained controllability by allowing users or large language models (LLMs) to specify local musical descriptions aligned to song segments. These segment prompts are temporally broadcast to corresponding time windows, while global prompts ensure stylistic coherence. To support precise lyric-to-music alignment, we introduce an LLM-based duration predictor that autoregressively generates sentence-level timestamps in LyRiCs format. We further construct a large-scale data pipeline for high-quality song collection with aligned lyrics and prompts, and propose new metrics to evaluate segment alignment and vocal consistency. Experiments demonstrate that SegTune outperforms existing baselines in both musicality and controllability. Visit our Demo Page for more generated songs of SegTune.
- 2026.04.17 🎉: We release the SegTune codebase, including the training and inference scripts, and evalution pipelines.
Requirements: Python 3.10 is required.
To set up the environment for SegTune:
Note: If you plan to use the LLM-based LRC Composer (
src/lrc_gen/), copyconfig/gpt_config.json.exampletoconfig/gpt_config.jsonand fill in your own Azure OpenAI credentials.
# Install espeak-ng (for phonemization)
# For Debian/Ubuntu
sudo apt-get install espeak-ng
# Create conda environment
conda create -n segtune python=3.10
conda activate segtune
# Install requirements
pip install -r requirements.txtThis repository contains the following main directories:
src/model/: SegTune core model (CFM, DiT, TemporalControlDiT, Trainer)src/dataset/: Dataset classes for diffusion and temporal control trainingsrc/dpo/: Standalone DPO (Direct Preference Optimization) moduledpo_cfm.py: DPOCFM model for win/loss preference trainingdpo_dataset.py: DPO dataset handling win/loss latent pairs
src/dpo_jam/: DPO training integrated with the JAM frameworkdpo_cfm.py: JAM-compatible DPOCFM modeldpo_dataset.py: JAM-compatible DPO datasetdpo_trainer.py: DPO-specific training loop with preference metricstrain_dpo.py: DPO training entry point
src/lrc_prediction/: Qwen3-based LRC timestamp prediction modulefinetuning.py: LoRA fine-tuning scriptinference.py: LRC prediction inferencedata_preprocessing.py: Data preprocessing utilitiesconfigs/: Training configurations
src/lrc_gen/: LLM-based Composer for generating timestamped LRCsrc/g2p/: Grapheme-to-Phoneme conversion (CN/EN multilingual)src/preprocess/: Feature extraction and audio preprocessinginfer/: Inference scriptspipeline.py: End-to-end SegTune pipeline (LRC → Audio)termporal_control_infer.py: Global and segmental control inferenceinfer_utils.py: Inference utilities
train/: Training scriptstemporal_control_train.py: Temporal control training
config/: Model configurationsscripts/: Shell scripts for training and inferencethirdparty/: Bundled third-party libraries (LangSegment)
First, prepare your dataset in jsonl format with the following fields:
{
"lrc_path": "path/to/timestamped.lrc",
"flamingo_struct": {
"global_analysis": "Song description...",
"segment_analyses": ["Verse description...", "Chorus description..."]
}
}Then configure and train:
# Edit config
vim src/lrc_prediction/configs/config.yaml
# Single GPU training
python src/lrc_prediction/finetuning.py \
--config src/lrc_prediction/configs/config.yaml
# Multi-GPU training with accelerate
accelerate launch --config_file src/lrc_prediction/configs/accelerate_config.yaml \
src/lrc_prediction/finetuning.py \
--config src/lrc_prediction/configs/config.yamlThe fine-tuned LoRA weights will be saved to ./exps/train/fine_tuned_model/.
# Using the provided script
bash scripts/train.sh
# Or run directly with accelerate
accelerate launch --config_file config/accelerate_config.yaml \
train/temporal_control_train.py \
--model-config config/diffrhythm-1b_qwen3.json \
--exp-name segtune-exp \
--batch-size 8 \
--max-frames 3000 \
--learning-rate 7.5e-5 \
--epochs 110 \
--file-path "dataset/train.scp"Key parameters:
--model-config: Model architecture config (default:config/diffrhythm-1b_qwen3.json)--batch-size: Training batch size--max-frames/--min-frames: Audio frame range--cond-drop-prob,--style-drop-prob,--lrc-drop-prob: Dropout rates for CFG training--file-path: Path to training data list file
Trained checkpoints will be saved to ckpts/{exp_name}/.
SegTune integrates LRC timestamp prediction (Qwen3 LoRA) and the diffusion model into a single pipeline:
Raw lyrics + Song description
↓
LRC Prediction (Qwen3 LoRA)
↓
Timestamped .lrc file
↓
SegTune (Latent Diffusion)
↓
Full-length song audio
Note: --jsonl-path and --lrc-path are mutually exclusive (choose one).
# Method 1: Full pipeline (jsonl → LRC prediction → audio generation)
python infer/pipeline.py \
--jsonl-path datasets/test/test.jsonl \
--output-dir infer/example/output
# --lrc-model-name and --lrc-lora-dir have default values, override if needed
# Method 2: Skip LRC prediction, directly generate audio from existing .lrc file
python infer/pipeline.py \
--lrc-path infer/example/input.lrcWe thank the following open-source projects that make SegTune possible:
- DiffRhythm (ASLP-lab, NWPU): The foundational latent diffusion model for full-length song generation.
- Qwen3 (Alibaba): The base language model for LRC timestamp prediction.
- F5-TTS (SWivid): The Conditional Flow Matching (CFM) training and inference framework adapted for audio generation.
- Amphion (OpenMMLab): Audio processing utilities and evaluation metrics.
- ms-swift (ModelScope): Training framework for LLM fine-tuning.
- MuQ (OpenMuQ): MuQ-MuLan audio-language contrastive model used as the style/audio prompt encoder.
- LangSegment (juntaosun): Multilingual text segmentation library used in the G2P pipeline for mixed-language lyric processing.
- phonemizer (bootphon): Multilingual text-to-phoneme conversion via the eSpeak-NG backend, used in G2P processing.
If you find our work useful, please cite:
@article{segtune2026,
title={SegTune: Structured and Fine-Grained Control for Song Generation},
author={Yuejiao Wang*, Zihao Ji*, Pengfei Cai*, Xu Li†, Haorui Zheng, Zewen Song, Zhongliang Liu, Chen Zhang, Pengfei Wan},
journal={The 64th Annual Meeting of the Association for Computational Linguistics(ACL 2026)},
year={2026}
}For questions and discussions, please contact us at yuejiaowang@link.cuhk.edu.hk or lixu15@kuaishou.com.
