Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

moss-tts-cnb/finetuning/README.md

轩

init

52273b4c

0 commits

PreviewCode viewBlame

MOSS-TTS-Nano Finetuning Guide

This directory provides a complete finetuning workflow for MOSS-TTS-Nano:

prepare_data.py: precomputes audio_codes for target audio and, when needed, ref_audio_codes
dataset.py: packs fields such as text / instruction / ambient_sound / ref_audio into teacher-forcing samples
sft.py: supports single-GPU, data parallel, and multi-node training
verify.py: provides basic non-streaming inference checks
run_train.sh: one-click wrapper that chains preprocessing and training

Default model weight locations:

TTS model: ./models/MOSS-TTS-Nano
Audio codec: ./models/MOSS-Audio-Tokenizer-Nano

1. Install Dependencies

From the repository root:


cd /path/to/MOSS-TTS-Nano
pip install -r requirements.txt

requirements.txt already includes:

accelerate>=1.0.0
tqdm>=4.66.0

2. Raw JSONL Format

The Nano finetuning pipeline mainly supports the following two formats.

2.1 Plain `text, speech` pairs


{"audio":"./data/utt0001.wav","text":"I realized that I am actually very good at noticing other people's emotions.","language":"en"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","language":"en"}

2.2 Voice Cloning / Reference-Conditioned Training

Only one reference field is supported:

ref_audio: a single reference audio clip

Example:


{"audio":"./data/utt0001.wav","text":"I realized that I am actually very good at noticing other people's emotions.","ref_audio":"./data/ref.wav","language":"en"}
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav","language":"en"}

2.3 Optional Fields

If needed, you can also provide the following fields. They will be appended to the user prompt:

instruction
tokens
quality
sound_event
ambient_sound
language

2.4 Path Rules

Relative paths in the raw JSONL are resolved relative to the JSONL file location.
Training expects preprocessed JSONL input, which means each record must already contain audio_codes.
If reference-conditioned training is used, the training JSONL must also already contain ref_audio_codes.
Nano finetuning currently supports only a single reference audio per sample.

3. Data Preprocessing

prepare_data.py does two things:

Encodes audio into audio_codes
Encodes ref_audio into ref_audio_codes by default

3.1 Single Process


python finetuning/prepare_data.py \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --input-jsonl train_raw.jsonl \
  --output-jsonl train_with_codes.jsonl \
  --batch-size 8

If you only want to encode target audio and skip reference audio:


python finetuning/prepare_data.py \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --input-jsonl train_raw.jsonl \
  --output-jsonl train_with_codes.jsonl \
  --skip-reference-audio-codes

3.2 Multi-Node / Multi-GPU Parallel Encoding

prepare_data.py follows the standard accelerate launch multi-process semantics.
For example, with 2 nodes and 16 GPUs in total, the input is split into 16 shards and each rank writes its own output shard:


accelerate launch --num_processes 16 finetuning/prepare_data.py \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --input-jsonl train_raw.jsonl \
  --output-jsonl prepared/train_with_codes.jsonl

The outputs look like:

prepared/train_with_codes.rank00000-of-00016.jsonl
prepared/train_with_codes.rank00001-of-00016.jsonl
...
prepared/train_with_codes.rank00015-of-00016.jsonl

During training, sft.py can directly read:

a single JSONL file
a directory
a glob such as prepared/train_with_codes.rank*.jsonl
or a comma-separated list of files

If your platform already injects multi-node communication environment variables, accelerate launch can usually reuse them directly.

4. Training

4.1 Single-GPU Baseline


accelerate launch finetuning/sft.py \
  --model-path ./models/MOSS-TTS-Nano \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --train-jsonl train_with_codes.jsonl \
  --output-dir output/moss_tts_nano_sft \
  --per-device-batch-size 1 \
  --gradient-accumulation-steps 8 \
  --learning-rate 1e-5 \
  --warmup-ratio 0.03 \
  --num-epochs 3 \
  --mixed-precision bf16 \
  --max-length 1024 \
  --channelwise-loss-weight 1,32

4.2 Single-Machine 8-GPU DDP


accelerate launch \
  --config_file finetuning/configs/accelerate_ddp_8gpu.yaml \
  finetuning/sft.py \
  --model-path ./models/MOSS-TTS-Nano \
  --codec-path ./models/MOSS-Audio-Tokenizer-Nano \
  --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
  --output-dir output/moss_tts_nano_sft_ddp \
  --per-device-batch-size 1 \
  --gradient-accumulation-steps 4 \
  --learning-rate 1e-5 \
  --num-epochs 3 \
  --mixed-precision bf16 \
  --max-length 1024 \
  --channelwise-loss-weight 1,32

4.3 Multi-Node Training

Update the following fields in your config file to match your cluster:

num_machines
num_processes
machine_rank
main_process_ip
main_process_port

Keep the rest of the training command unchanged.

4.4 Important Arguments

--max-length: fixed full sequence length. Samples are truncated to this length and then padded.
--channelwise-loss-weight: supports two formats
- text_head,vq0,...,vqN
- text_weight,total_audio_weight
--save-every-epochs: save one checkpoint every N epochs.

Single-GPU memory reference:

With accelerate launch --num_processes 1 and --per-device-batch-size 1 --gradient-accumulation-steps 1 --max-length 1024 --mixed-precision bf16, the measured training-process peak memory usage is about 3.23 GiB.

4.5 Checkpoint Contents

Each checkpoint directory can be loaded directly by the inference code in this repository. It contains:

model weights
config.json
tokenizer files
the Nano model Python source files needed for loading
finetune_config.json

5. One-Click Script

If you want a simple wrapper that chains preprocessing and training:


bash finetuning/run_train.sh

Common environment variables:

RAW_JSONL: raw training JSONL
PREPARED_JSONL: preprocessed JSONL
TRAIN_JSONL: training input; if unset, it is inferred from PREPARED_JSONL
OUTPUT_DIR: output directory
SKIP_PREPARE=1: skip preprocessing and train directly
PREP_ACCELERATE_ARGS_STR: extra accelerate args for prepare_data.py
TRAIN_ACCELERATE_ARGS_STR: extra accelerate launch args for training, mainly for overriding num_machines / num_processes / machine_rank
PREP_EXTRA_ARGS_STR: extra args passed to prepare_data.py
TRAIN_EXTRA_ARGS_STR: extra args passed to sft.py
ACCELERATE_CONFIG_FILE: training-time accelerate config file; if TRAIN_ACCELERATE_ARGS_STR is also provided, command-line values override the config defaults

Example:


RAW_JSONL=train_raw.jsonl \
PREPARED_JSONL=prepared/train_with_codes.jsonl \
OUTPUT_DIR=output/moss_tts_nano_sft \
PREP_ACCELERATE_ARGS_STR='--num_processes 8' \
ACCELERATE_CONFIG_FILE=finetuning/configs/accelerate_ddp_8gpu.yaml \
TRAIN_EXTRA_ARGS_STR='--per-device-batch-size 1 --gradient-accumulation-steps 4 --learning-rate 1e-5 --num-epochs 3 --mixed-precision bf16 --max-length 1024 --channelwise-loss-weight 1,32' \
bash finetuning/run_train.sh

For multi-node runs, the same idea applies: prepare shared encoded data first, then adjust ACCELERATE_CONFIG_FILE or TRAIN_ACCELERATE_ARGS_STR for your cluster.

6. Quick Verification

verify.py keeps the inference path intentionally simple. It supports:

voice_clone: reference audio + target text
continuation: continuation mode, with two input patterns
- prompt_text + prompt_audio_path + text
- or only text, which degrades to plain TTS

6.1 Voice Clone Verification


python finetuning/verify.py \
  --checkpoint output/moss_tts_nano_sft/checkpoint-last \
  --mode voice_clone \
  --text "This is a quick validation example for a finetuned model." \
  --prompt-audio-path ./assets/audio/zh_1.wav \
  --output-audio-path output/verify_voice_clone.wav

6.2 Continuation Verification

If continuation is used with prompt-audio-path, you must also provide the corresponding prompt-text:


python finetuning/verify.py \
  --checkpoint output/moss_tts_nano_sft/checkpoint-last \
  --mode continuation \
  --prompt-text "This sentence has already been spoken in the prompt audio." \
  --prompt-audio-path ./assets/audio/zh_1.wav \
  --text "This next sentence continues from that prompt for a quick continuation check." \
  --output-audio-path output/verify_continuation.wav

6.3 Plain TTS Verification

If you only want plain text-to-speech without reference audio, still use continuation, but do not pass prompt-text or prompt-audio-path:


python finetuning/verify.py \
  --checkpoint output/moss_tts_nano_sft/checkpoint-last \
  --mode continuation \
  --text "This is a quick non-streaming validation example." \
  --output-audio-path output/verify_tts.wav

You can also continue using the repository-level infer.py. Checkpoints saved by finetuning are already packaged in a format that infer.py can load directly.

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111