🎬 Fun-CineForge: A Unified Dataset Pipeline and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
Fun-CineForge contains an end-to-end dataset pipeline for producing large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using this pipeline, we constructed the first large-scale Chinese television dubbing dataset CineDub-CN, which includes rich annotations and diverse scenes. In monologue, narration, dialogue, and multi-speaker scenes, our dubbing model consistently outperforms state-of-the-art methods in terms of audio quality, lip-sync, timbre transition, and instruction following.
You can access https://funcineforge.github.io/ to get our CineDub-CN dataset samples and demo samples.
GitHub link: https://github.com/FunAudioLLM/FunCineForge/
Modelscope link: https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/
CineDub Samples: huggingface modelscope
Fun-CineForge dataset pipeline toolkit only relies on a Python environment to run.
# Conda
git clone git@github.com:FunAudioLLM/FunCineForge.git
conda create -n FunCineForge python=3.10 -y && conda activate FunCineForge
sudo apt-get install ffmpeg
# Initial settings
python setup.py
If you want to produce your own data, we recommend that you refer to the following requirements to collect the corresponding movies or television series.
python normalize_trim.py --root datasets/raw_zh --intro 10 --outro 10
cd speech_separation python run.py --root datasets/clean/zh --gpus 0 1 2 3
cd video_clip bash run.sh --stage 1 --stop_stage 2 --input datasets/raw_zh --output datasets/clean/zh --lang zh --device cpu
python clean_video.py --root datasets/clean/zh python clean_srt.py --root datasets/clean/zh --lang zh
cd speaker_diarization bash run.sh --stage 1 --stop_stage 4 --hf_access_token hf_xxx --root datasets/clean/zh --gpus "0 1 2 3"
python cot.py --root_dir datasets/clean/zh --lang zh --provider google --model gemini-3-pro-preview --api_key xxx --resume python cot.py --root_dir datasets/clean/en --lang en --provider google --model gemini-3-pro-preview --api_key xxx --resume python build_datasets.py --root_zh datasets/clean/zh --root_en datasets/clean/en --out_dir datasets/clean --save
python speech_tokenizer.py --root datasets/clean/zh
We've open-sourced the inference code and the infer.sh script, and provided some test cases in the data folder for your experience. Inference requires a consumer-grade GPU. Run the following command:
cd exps bash infer.sh
The API for multi-speaker dubbing from raw videos and SRT scripts is under development ...
If you use our dataset or code, please cite the following paper:
@misc{liu2026funcineforgeunifieddatasettoolkit,
title={FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes},
author={Jiaxuan Liu and Yang Xiang and Han Zhao and Xiangang Li and Zhenhua Ling},
year={2026},
eprint={2601.14777},
archivePrefix={arXiv},
primaryClass={cs.CV},
}
We welcome you to participate in discussions on Fun-CineForge GitHub Issues or contact us for collaborative development. For any questions, you can contact the developer.
This repository contains research artifacts:
⚠️ Currently not a commercial product of Tongyi Lab.
⚠️ Released for academic research / cutting-edge exploration purposes
⚠️ CineDub Dataset samples are subject to specific license terms.