logo
0
0
WeChat Login

MOSS-TTS-Nano Finetuning Guide

This directory provides a complete finetuning workflow for MOSS-TTS-Nano:

  • prepare_data.py: precomputes audio_codes for target audio and, when needed, ref_audio_codes
  • dataset.py: packs fields such as text / instruction / ambient_sound / ref_audio into teacher-forcing samples
  • sft.py: supports single-GPU, data parallel, and multi-node training
  • verify.py: provides basic non-streaming inference checks
  • run_train.sh: one-click wrapper that chains preprocessing and training

Default model weight locations:

  • TTS model: ./models/MOSS-TTS-Nano
  • Audio codec: ./models/MOSS-Audio-Tokenizer-Nano

1. Install Dependencies

From the repository root:

cd /path/to/MOSS-TTS-Nano pip install -r requirements.txt

requirements.txt already includes:

  • accelerate>=1.0.0
  • tqdm>=4.66.0

2. Raw JSONL Format

The Nano finetuning pipeline mainly supports the following two formats.

2.1 Plain text, speech pairs

{"audio":"./data/utt0001.wav","text":"I realized that I am actually very good at noticing other people's emotions.","language":"en"} {"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","language":"en"}

2.2 Voice Cloning / Reference-Conditioned Training

Only one reference field is supported:

  • ref_audio: a single reference audio clip

Example:

{"audio":"./data/utt0001.wav","text":"I realized that I am actually very good at noticing other people's emotions.","ref_audio":"./data/ref.wav","language":"en"} {"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav","language":"en"}

2.3 Optional Fields

If needed, you can also provide the following fields. They will be appended to the user prompt:

  • instruction
  • tokens
  • quality
  • sound_event
  • ambient_sound
  • language

2.4 Path Rules

  • Relative paths in the raw JSONL are resolved relative to the JSONL file location.
  • Training expects preprocessed JSONL input, which means each record must already contain audio_codes.
  • If reference-conditioned training is used, the training JSONL must also already contain ref_audio_codes.
  • Nano finetuning currently supports only a single reference audio per sample.

3. Data Preprocessing

prepare_data.py does two things:

  1. Encodes audio into audio_codes
  2. Encodes ref_audio into ref_audio_codes by default

3.1 Single Process

python finetuning/prepare_data.py \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --input-jsonl train_raw.jsonl \ --output-jsonl train_with_codes.jsonl \ --batch-size 8

If you only want to encode target audio and skip reference audio:

python finetuning/prepare_data.py \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --input-jsonl train_raw.jsonl \ --output-jsonl train_with_codes.jsonl \ --skip-reference-audio-codes

3.2 Multi-Node / Multi-GPU Parallel Encoding

prepare_data.py follows the standard accelerate launch multi-process semantics.
For example, with 2 nodes and 16 GPUs in total, the input is split into 16 shards and each rank writes its own output shard:

accelerate launch --num_processes 16 finetuning/prepare_data.py \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --input-jsonl train_raw.jsonl \ --output-jsonl prepared/train_with_codes.jsonl

The outputs look like:

  • prepared/train_with_codes.rank00000-of-00016.jsonl
  • prepared/train_with_codes.rank00001-of-00016.jsonl
  • ...
  • prepared/train_with_codes.rank00015-of-00016.jsonl

During training, sft.py can directly read:

  • a single JSONL file
  • a directory
  • a glob such as prepared/train_with_codes.rank*.jsonl
  • or a comma-separated list of files

If your platform already injects multi-node communication environment variables, accelerate launch can usually reuse them directly.

4. Training

4.1 Single-GPU Baseline

accelerate launch finetuning/sft.py \ --model-path ./models/MOSS-TTS-Nano \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --train-jsonl train_with_codes.jsonl \ --output-dir output/moss_tts_nano_sft \ --per-device-batch-size 1 \ --gradient-accumulation-steps 8 \ --learning-rate 1e-5 \ --warmup-ratio 0.03 \ --num-epochs 3 \ --mixed-precision bf16 \ --max-length 1024 \ --channelwise-loss-weight 1,32

4.2 Single-Machine 8-GPU DDP

accelerate launch \ --config_file finetuning/configs/accelerate_ddp_8gpu.yaml \ finetuning/sft.py \ --model-path ./models/MOSS-TTS-Nano \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --train-jsonl 'prepared/train_with_codes.rank*.jsonl' \ --output-dir output/moss_tts_nano_sft_ddp \ --per-device-batch-size 1 \ --gradient-accumulation-steps 4 \ --learning-rate 1e-5 \ --num-epochs 3 \ --mixed-precision bf16 \ --max-length 1024 \ --channelwise-loss-weight 1,32

4.3 Multi-Node Training

Update the following fields in your config file to match your cluster:

  • num_machines
  • num_processes
  • machine_rank
  • main_process_ip
  • main_process_port

Keep the rest of the training command unchanged.

4.4 Important Arguments

  • --max-length: fixed full sequence length. Samples are truncated to this length and then padded.
  • --channelwise-loss-weight: supports two formats
    • text_head,vq0,...,vqN
    • text_weight,total_audio_weight
  • --save-every-epochs: save one checkpoint every N epochs.

Single-GPU memory reference:

  • With accelerate launch --num_processes 1 and --per-device-batch-size 1 --gradient-accumulation-steps 1 --max-length 1024 --mixed-precision bf16, the measured training-process peak memory usage is about 3.23 GiB.

4.5 Checkpoint Contents

Each checkpoint directory can be loaded directly by the inference code in this repository. It contains:

  • model weights
  • config.json
  • tokenizer files
  • the Nano model Python source files needed for loading
  • finetune_config.json

5. One-Click Script

If you want a simple wrapper that chains preprocessing and training:

bash finetuning/run_train.sh

Common environment variables:

  • RAW_JSONL: raw training JSONL
  • PREPARED_JSONL: preprocessed JSONL
  • TRAIN_JSONL: training input; if unset, it is inferred from PREPARED_JSONL
  • OUTPUT_DIR: output directory
  • SKIP_PREPARE=1: skip preprocessing and train directly
  • PREP_ACCELERATE_ARGS_STR: extra accelerate args for prepare_data.py
  • TRAIN_ACCELERATE_ARGS_STR: extra accelerate launch args for training, mainly for overriding num_machines / num_processes / machine_rank
  • PREP_EXTRA_ARGS_STR: extra args passed to prepare_data.py
  • TRAIN_EXTRA_ARGS_STR: extra args passed to sft.py
  • ACCELERATE_CONFIG_FILE: training-time accelerate config file; if TRAIN_ACCELERATE_ARGS_STR is also provided, command-line values override the config defaults

Example:

RAW_JSONL=train_raw.jsonl \ PREPARED_JSONL=prepared/train_with_codes.jsonl \ OUTPUT_DIR=output/moss_tts_nano_sft \ PREP_ACCELERATE_ARGS_STR='--num_processes 8' \ ACCELERATE_CONFIG_FILE=finetuning/configs/accelerate_ddp_8gpu.yaml \ TRAIN_EXTRA_ARGS_STR='--per-device-batch-size 1 --gradient-accumulation-steps 4 --learning-rate 1e-5 --num-epochs 3 --mixed-precision bf16 --max-length 1024 --channelwise-loss-weight 1,32' \ bash finetuning/run_train.sh

For multi-node runs, the same idea applies: prepare shared encoded data first, then adjust ACCELERATE_CONFIG_FILE or TRAIN_ACCELERATE_ARGS_STR for your cluster.

6. Quick Verification

verify.py keeps the inference path intentionally simple. It supports:

  • voice_clone: reference audio + target text
  • continuation: continuation mode, with two input patterns
    • prompt_text + prompt_audio_path + text
    • or only text, which degrades to plain TTS

6.1 Voice Clone Verification

python finetuning/verify.py \ --checkpoint output/moss_tts_nano_sft/checkpoint-last \ --mode voice_clone \ --text "This is a quick validation example for a finetuned model." \ --prompt-audio-path ./assets/audio/zh_1.wav \ --output-audio-path output/verify_voice_clone.wav

6.2 Continuation Verification

If continuation is used with prompt-audio-path, you must also provide the corresponding prompt-text:

python finetuning/verify.py \ --checkpoint output/moss_tts_nano_sft/checkpoint-last \ --mode continuation \ --prompt-text "This sentence has already been spoken in the prompt audio." \ --prompt-audio-path ./assets/audio/zh_1.wav \ --text "This next sentence continues from that prompt for a quick continuation check." \ --output-audio-path output/verify_continuation.wav

6.3 Plain TTS Verification

If you only want plain text-to-speech without reference audio, still use continuation, but do not pass prompt-text or prompt-audio-path:

python finetuning/verify.py \ --checkpoint output/moss_tts_nano_sft/checkpoint-last \ --mode continuation \ --text "This is a quick non-streaming validation example." \ --output-audio-path output/verify_tts.wav

You can also continue using the repository-level infer.py. Checkpoints saved by finetuning are already packaged in a format that infer.py can load directly.