This directory provides a complete finetuning workflow for MOSS-TTS-Nano:
prepare_data.py: precomputes audio_codes for target audio and, when needed, ref_audio_codesdataset.py: packs fields such as text / instruction / ambient_sound / ref_audio into teacher-forcing samplessft.py: supports single-GPU, data parallel, and multi-node trainingverify.py: provides basic non-streaming inference checksrun_train.sh: one-click wrapper that chains preprocessing and trainingDefault model weight locations:
./models/MOSS-TTS-Nano./models/MOSS-Audio-Tokenizer-NanoFrom the repository root:
cd /path/to/MOSS-TTS-Nano
pip install -r requirements.txt
requirements.txt already includes:
accelerate>=1.0.0tqdm>=4.66.0The Nano finetuning pipeline mainly supports the following two formats.
{"audio":"./data/utt0001.wav","text":"I realized that I am actually very good at noticing other people's emotions.","language":"en"} {"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","language":"en"}
Only one reference field is supported:
ref_audio: a single reference audio clipExample:
{"audio":"./data/utt0001.wav","text":"I realized that I am actually very good at noticing other people's emotions.","ref_audio":"./data/ref.wav","language":"en"} {"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav","language":"en"}
If needed, you can also provide the following fields. They will be appended to the user prompt:
instructiontokensqualitysound_eventambient_soundlanguageaudio_codes.ref_audio_codes.prepare_data.py does two things:
audio into audio_codesref_audio into ref_audio_codes by defaultpython finetuning/prepare_data.py \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --input-jsonl train_raw.jsonl \ --output-jsonl train_with_codes.jsonl \ --batch-size 8
If you only want to encode target audio and skip reference audio:
python finetuning/prepare_data.py \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --input-jsonl train_raw.jsonl \ --output-jsonl train_with_codes.jsonl \ --skip-reference-audio-codes
prepare_data.py follows the standard accelerate launch multi-process semantics.
For example, with 2 nodes and 16 GPUs in total, the input is split into 16 shards and each rank writes its own output shard:
accelerate launch --num_processes 16 finetuning/prepare_data.py \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --input-jsonl train_raw.jsonl \ --output-jsonl prepared/train_with_codes.jsonl
The outputs look like:
prepared/train_with_codes.rank00000-of-00016.jsonlprepared/train_with_codes.rank00001-of-00016.jsonlprepared/train_with_codes.rank00015-of-00016.jsonlDuring training, sft.py can directly read:
prepared/train_with_codes.rank*.jsonlIf your platform already injects multi-node communication environment variables, accelerate launch can usually reuse them directly.
accelerate launch finetuning/sft.py \ --model-path ./models/MOSS-TTS-Nano \ --codec-path ./models/MOSS-Audio-Tokenizer-Nano \ --train-jsonl train_with_codes.jsonl \ --output-dir output/moss_tts_nano_sft \ --per-device-batch-size 1 \ --gradient-accumulation-steps 8 \ --learning-rate 1e-5 \ --warmup-ratio 0.03 \ --num-epochs 3 \ --mixed-precision bf16 \ --max-length 1024 \ --channelwise-loss-weight 1,32
accelerate launch \
--config_file finetuning/configs/accelerate_ddp_8gpu.yaml \
finetuning/sft.py \
--model-path ./models/MOSS-TTS-Nano \
--codec-path ./models/MOSS-Audio-Tokenizer-Nano \
--train-jsonl 'prepared/train_with_codes.rank*.jsonl' \
--output-dir output/moss_tts_nano_sft_ddp \
--per-device-batch-size 1 \
--gradient-accumulation-steps 4 \
--learning-rate 1e-5 \
--num-epochs 3 \
--mixed-precision bf16 \
--max-length 1024 \
--channelwise-loss-weight 1,32
Update the following fields in your config file to match your cluster:
num_machinesnum_processesmachine_rankmain_process_ipmain_process_portKeep the rest of the training command unchanged.
--max-length: fixed full sequence length. Samples are truncated to this length and then padded.--channelwise-loss-weight: supports two formats
text_head,vq0,...,vqNtext_weight,total_audio_weight--save-every-epochs: save one checkpoint every N epochs.Single-GPU memory reference:
accelerate launch --num_processes 1 and --per-device-batch-size 1 --gradient-accumulation-steps 1 --max-length 1024 --mixed-precision bf16, the measured training-process peak memory usage is about 3.23 GiB.Each checkpoint directory can be loaded directly by the inference code in this repository. It contains:
config.jsonfinetune_config.jsonIf you want a simple wrapper that chains preprocessing and training:
bash finetuning/run_train.sh
Common environment variables:
RAW_JSONL: raw training JSONLPREPARED_JSONL: preprocessed JSONLTRAIN_JSONL: training input; if unset, it is inferred from PREPARED_JSONLOUTPUT_DIR: output directorySKIP_PREPARE=1: skip preprocessing and train directlyPREP_ACCELERATE_ARGS_STR: extra accelerate args for prepare_data.pyTRAIN_ACCELERATE_ARGS_STR: extra accelerate launch args for training, mainly for overriding num_machines / num_processes / machine_rankPREP_EXTRA_ARGS_STR: extra args passed to prepare_data.pyTRAIN_EXTRA_ARGS_STR: extra args passed to sft.pyACCELERATE_CONFIG_FILE: training-time accelerate config file; if TRAIN_ACCELERATE_ARGS_STR is also provided, command-line values override the config defaultsExample:
RAW_JSONL=train_raw.jsonl \
PREPARED_JSONL=prepared/train_with_codes.jsonl \
OUTPUT_DIR=output/moss_tts_nano_sft \
PREP_ACCELERATE_ARGS_STR='--num_processes 8' \
ACCELERATE_CONFIG_FILE=finetuning/configs/accelerate_ddp_8gpu.yaml \
TRAIN_EXTRA_ARGS_STR='--per-device-batch-size 1 --gradient-accumulation-steps 4 --learning-rate 1e-5 --num-epochs 3 --mixed-precision bf16 --max-length 1024 --channelwise-loss-weight 1,32' \
bash finetuning/run_train.sh
For multi-node runs, the same idea applies: prepare shared encoded data first, then adjust ACCELERATE_CONFIG_FILE or TRAIN_ACCELERATE_ARGS_STR for your cluster.
verify.py keeps the inference path intentionally simple. It supports:
voice_clone: reference audio + target textcontinuation: continuation mode, with two input patterns
prompt_text + prompt_audio_path + texttext, which degrades to plain TTSpython finetuning/verify.py \
--checkpoint output/moss_tts_nano_sft/checkpoint-last \
--mode voice_clone \
--text "This is a quick validation example for a finetuned model." \
--prompt-audio-path ./assets/audio/zh_1.wav \
--output-audio-path output/verify_voice_clone.wav
If continuation is used with prompt-audio-path, you must also provide the corresponding prompt-text:
python finetuning/verify.py \
--checkpoint output/moss_tts_nano_sft/checkpoint-last \
--mode continuation \
--prompt-text "This sentence has already been spoken in the prompt audio." \
--prompt-audio-path ./assets/audio/zh_1.wav \
--text "This next sentence continues from that prompt for a quick continuation check." \
--output-audio-path output/verify_continuation.wav
If you only want plain text-to-speech without reference audio, still use continuation, but do not pass prompt-text or prompt-audio-path:
python finetuning/verify.py \
--checkpoint output/moss_tts_nano_sft/checkpoint-last \
--mode continuation \
--text "This is a quick non-streaming validation example." \
--output-audio-path output/verify_tts.wav
You can also continue using the repository-level infer.py. Checkpoints saved by finetuning are already packaged in a format that infer.py can load directly.