
中文 | English
minimind-3o has only ~0.1B parameters: it can be trained on a consumer GPU and runs quickly on CPU, making it one of the smallest fully-functional Omni implementations publicly available.mini and full. mini runs the full pipeline in about 2 hours on a single RTX 3090 and is intended for getting started; full corresponds to the released weights.Note: "about 2 hours" refers to the measured time of running SFT on the mini dataset using a single NVIDIA RTX 3090.
After MiniMind (LLM) and MiniMind-V (VLM), MiniMind-O is the third stop in this series. By "Omni" we mean a model that can listen, see and speak at the same time: it takes text, speech and visual signals as inputs, and produces text together with streaming speech.
GPT-4o was probably the first system that made natural streaming voice interaction feel real. Since then, open-source projects such as Mini-Omni2, Moshi, GLM-4-Voice and Qwen3-Omni have gradually appeared. However, if the goal is not just to call ready-made checkpoints with billions of parameters, but to fully understand, train and modify a complete Omni model from scratch, the open-source community still lacks a sufficiently lightweight starting point with an end-to-end pipeline. A common way to bring speech into an Omni model is to chain ASR, LLM and TTS into a cascade: speech is first transcribed to text, the LLM processes it, and the answer is then synthesized back to speech. This is straightforward from an engineering perspective, but it adds an extra transcription step and noticeably hurts latency, prosody and emotional cues.
MiniMind-O attempts to fill this gap: speech and text are connected directly at the hidden-state level, while the trainable backbone remains only ~0.1B parameters and the end-to-end Omni pipeline is preserved. The Talker side adopts MTP (Multi-Token Prediction) to predict multiple Mimi codebook layers at once, and combines it with VAD to support real-time barge-in and near-duplex interaction—a practical engineering route for a tiny Omni model. The code, model weights, training data and technical report are all open-sourced. A single RTX 3090 can finish training on the mini dataset in about 2 hours. The goal remains the same: let everyone read the project from the first line of code, and train, from scratch, a model that can listen, see, think and speak:

😊 Enjoy building.
mini and full. mini is meant for quick onboarding and runs the pipeline in ~2 hours on a single RTX 3090; full matches the released weights and covers Chinese speech and image tasks.transformers tokenizers and native weight formats.| Model | Backbone params | Release |
|---|---|---|
| minimind-3o | ~0.1B | 2026.05.05 |
| minimind-3o-moe | ~0.3B-A0.1B | 2026.05.05 |
minimind-3o (115M) and minimind-3o-moe (312M-A115M).# Clone the repository
git clone --depth 1 https://github.com/jingyaogong/minimind-o
# Install dependencies
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
# Download SenseVoice-Small audio encoder to ./model/SenseVoiceSmall
modelscope download --model gongjy/SenseVoiceSmall --local_dir ./model/SenseVoiceSmall
# Download SigLIP2 vision encoder to ./model/siglip2-base-p32-256-ve
modelscope download --model gongjy/siglip2-base-p32-256-ve --local_dir ./model/siglip2-base-p32-256-ve
# Download Mimi audio codec to ./model/mimi
modelscope download --model gongjy/mimi --local_dir ./model/mimi
# Download CAM++ speaker encoder to ./model/campplus
modelscope download --model gongjy/campplus --local_dir ./model/campplus
# Download MiniMind LLM weights to ./out (used as the language backbone for training Omni)
modelscope download --model gongjy/minimind-3o-pytorch llm_768.pth --local_dir ./out
You can also git clone the corresponding repos from the ModelScope Collection or HuggingFace Collection (LFS required); details omitted here.
After downloading, the directory should look like:
minimind-o/
├── model/
│ ├── SenseVoiceSmall/
│ ├── siglip2-base-p32-256-ve/
│ ├── mimi/
│ ├── campplus/
│ └── ...
├── out/
│ └── llm_768.pth
└── ...
# Download released weights to ./out
modelscope download --model gongjy/minimind-3o-pytorch --local_dir ./out
python eval_omni.py --load_from model --weight sft_omni
To use the Transformers-format model, download the model directory first:
git clone https://huggingface.co/jingyaogong/minimind-3o
python eval_omni.py --load_from minimind-3o
# ⚠️ Copy the Transformers-format model folder into ./scripts/. The web_demo_omni script
# automatically scans this directory for sub-folders that contain weight files; it
# raises an error if none is found.
cp -r minimind-3o ./scripts/minimind-3o
cd scripts && python web_demo_omni.py
import torch
print(torch.cuda.is_available())
If unavailable, please download the matching .whl from torch_stable and install it manually.
For a quick start, downloading only the _mini parquet files from the dataset link and placing them under ./dataset is enough.
The recommended mini training pipeline is shown below. It is meant to be run from the trainer/ directory; equivalently, run cd trainer && bash train.sh:
CUDA_VISIBLE_DEVICES=0 torchrun --master_port 29560 --nproc_per_node 1 train_sft_omni.py --learning_rate 5e-4 --data_path ../dataset/sft_t2a_mini.parquet --epochs 1 --batch_size 40 --use_compile 1 --from_weight llm --save_weight sft_zero --max_seq_len 512 --use_wandb --use_moe 0
CUDA_VISIBLE_DEVICES=0 torchrun --master_port 29560 --nproc_per_node 1 train_sft_omni.py --learning_rate 5e-4 --data_path ../dataset/sft_a2a_mini.parquet --epochs 1 --batch_size 40 --use_compile 0 --from_weight sft_zero --save_weight sft_zero --max_seq_len 640 --mode audio_proj --use_wandb --use_moe 0
CUDA_VISIBLE_DEVICES=0 torchrun --master_port 29560 --nproc_per_node 1 train_sft_omni.py --learning_rate 2e-5 --data_path ../dataset/sft_a2a_mini.parquet --epochs 1 --batch_size 16 --use_compile 0 --from_weight sft_zero --save_weight sft_zero --max_seq_len 768 --use_wandb --use_moe 0
Make sure the model *.pth to be tested is placed under ./out/.
python eval_omni.py --weight sft_omni
The language backbone of MiniMind-O comes from the sister project MiniMind. For LLM architecture and training details, please refer to that repository. Even without going into the LLM internals, you can still follow the Quick Start section above to train MiniMind-O end-to-end.

MiniMind-O consists of two paths: Thinker and Talker. Thinker is responsible for understanding text, speech and image inputs and producing a semantic-level text reply. Talker takes the semantic conditions from Thinker and uses MTP to jointly predict multi-codebook Mimi audio codes, which the audio decoder finally restores into streaming speech. The point is not to chain ASR, LLM and TTS together, but to keep text reasoning, speech generation and streaming interaction inside a single unified sequence.
Text inputs go directly into the language backbone; speech and images are first encoded by the Audio Encoder and Vision Encoder respectively, and then projected into the MiniMind hidden space. Voice information is provided either by a Speaker Encoder or by reference-audio codes; combined with VAD at inference time, this enables listen-while-speaking, real-time barge-in and near-duplex interaction. Later sections describe the projectors, sequence layout and training objectives in more detail; for code-level details, please refer to model/model_omni.py and the technical report.

The figure above shows how text tokens, speech features, image features and voice conditions are laid out in the input sequence.
Thinker receives text, speech and image information uniformly and produces a semantic-level text reply. Text tokens enter the language backbone directly, while speech and image features are injected into placeholder positions through their respective projectors, so that all modalities are eventually modeled within the same sequence.
The representation passed from Thinker to Talker is taken from a middle layer rather than the embedding layer or the final layer. Embedding layers carry too little semantic information, while the final layer is overly shaped towards next-token prediction. A middle layer typically already fuses contextual and cross-modal information without being over-tuned by the LM head, which makes it a better conditioning source for speech generation. By default bridge_layer = num_hidden_layers // 2 - 1, and it can be adjusted through configuration at different scales.
Talker turns the semantic states from Thinker into 8 streams of Mimi codebook codes. It uses MTP to predict multiple audio codebooks simultaneously, instead of running each codebook through a separate long path. To control the additional parameter count inside a 0.1B model, the audio embedding and output head share a common backbone with lightweight per-codebook adapters. This preserves the distributional differences between codebooks while avoiding a full parameter copy for each codebook layer.

MiniMind-O places text tokens and 8 audio-code streams in the same training sample: Thinker handles the text sequence, Talker handles the audio-code sequence, and speech / image / voice conditions are injected through placeholders or reference codes. Loss on target text and target audio is computed only after the reply starts; reference and conditioning regions serve only as conditions and are not part of the reconstruction target.
For streaming generation, the model emits text tokens while simultaneously filling in 8 layers of Mimi codes via MTP and a delay schedule. The Mimi decoder can incrementally reconstruct the 24 kHz waveform, so playback does not have to wait for the full reply to finish.
Voice control is realized through in-context voice cloning: reference audio is first encoded into a voice prompt, and then fed to Talker as a contextual condition, instead of fine-tuning weights or rewriting the text prompt to specify a voice. The model can additionally use a speaker embedding to provide a more stable speaker constraint; switching the voice at inference time only requires changing these conditioning inputs, while the Thinker prompt and Talker weights remain unchanged.
The default release ships with 5 built-in voice prompts (dylan, eric, serena, uncle_fu, vivian), and reserves 7 unseen prompts for evaluation (arthur, chelsie, cherry, ethan, jennifer, momo, moon).
The "0.1B" referenced for MiniMind-O denotes the trainable backbone composed of Thinker, Talker and the two projectors. For the released checkpoints, minimind-3o is about 113M and minimind-3o-moe is about 315M. The Audio Encoder, Vision Encoder and Speech Codec are frozen external side modules used only for feature extraction or audio (de)coding; together they contain about 425M parameters and are not counted as active MiniMind-O parameters.
The table below counts the main module sizes per released model. Trainable counts are based on PyTorch modules, with tied embeddings deduplicated.
| Counting scope | minimind-3o | minimind-3o-moe |
|---|---|---|
| Trainable backbone | 113.13M | 314.89M |
| Frozen external modules | 424.70M | 424.70M |
| Total loaded at runtime | 537.83M | 739.59M |
| Module | Implementation | Key configuration | Status / params (~3o / ~3o-moe) |
|---|---|---|---|
| Thinker | MiniMind Transformer | 8 layers, hidden 768 | trainable, 63.91M / 198.42M |
| Talker | Standalone MiniMind blocks | 4 layers, 8 codebook heads | trainable, 47.05M / 114.30M |
| Audio projector | MMAudioProjector | 512 → 768 | trainable, 0.99M |
| Vision projector | MMVisionProjector | 768 → 768 | trainable, 1.18M |
| Audio encoder | SenseVoice-Small | 16 kHz speech features | frozen, 234.00M |
| Vision encoder | SigLIP2 base-p32-256 | 256×256 image, 64 tokens | frozen, 94.55M |
| Speech codec | Mimi | 8 codebooks, 12.5 Hz, 24 kHz | frozen, 96.15M |
| Speaker condition | CAM++ embedding | 192-d speaker vector | precomputed |
Dataset download: ModelScope | HuggingFace
All speech data is stored uniformly as Mimi codes (8 codebooks, 12.5 Hz frame rate). Images are resized uniformly to 256×256 and encoded by SigLIP2 P32 into 64 patch tokens. The training data mainly comes from public Omni / speech-instruction corpora, including VoiceAssistant-400K, UltraChat-300K-SLAM-Omni and others. A large amount of multi-speaker audio is additionally synthesized with Qwen3-TTS, and CAM++ is used to extract speaker embeddings as voice conditions. The I2T data follows the same source as the visual instruction data used in MiniMind-V; please refer to that project for the original composition and citations.
The repository ships two training sets, mini and full. The mini set is filtered from full using the "English + no-vision" criteria and works with train_sft_omni.py using the default --data_path. Its goal is to verify the Thinker–Talker pipeline, Mimi (de)coding, sequence layout and voice-injection path at low cost, rather than to reproduce the Chinese speech ability of the released models. A Chinese Talker has to handle more complex grapheme-to-phoneme mapping, prosodic pauses and multi-speaker stability, which is clearly harder than English and cannot be expected to converge within ~2 hours on a single RTX 3090.
The full set corresponds to the released minimind-3o / minimind-3o-moe checkpoints and covers Chinese-English T2A / A2A as well as image-to-text. Sizes and language ratios are listed below; this is the actual training source behind the CER / voice-similarity numbers reported in the paper.
T2A means Text-to-Audio, A2A means Audio-to-Audio, and I2T means Image-to-Text.
| Dataset | Subset | Input speech | Output speech | Note |
|---|---|---|---|---|
sft_t2a_mini | English T2A | — | ~470.14 h | mini onboarding |
sft_a2a_mini | English A2A | ~74.64 h | ~56.60 h | mini onboarding |
sft_t2a | zh+en T2A | — | ~1636.01 h | full training |
sft_a2a | zh+en A2A | ~1711.97 h | ~423.40 h | full training |
sft_i2t | Image I2T | — | — | full training |
In sft_t2a, Chinese / English / mixed samples account for 45.7% / 46.5% / 7.8% respectively; in sft_a2a the ratios are 70.8% / 21.2% / 8.0%. This distribution is directly reflected in behavior: short Chinese and short English replies are usually stable, while longer English speech is more prone to mispronunciation and word omissions. The mini subset keeps only English, so even with a tight budget on parameters and data, the within-language CER stays in a usable range.
The training entry point is train_sft_omni.py, and the recommended pipeline can be found in trainer/train.sh. Full training is not split into multiple complex pretraining stages; instead, capabilities are introduced incrementally along the data flow:

sft_t2a: align text with speech output first, so that Talker learns to generate Mimi codes under Thinker's semantic conditions;sft_a2a: bring in speech inputs, so that the model can enter the same Thinker–Talker reply path from speech instructions;sft_i2t: align the visual path last; the vision_proj mode updates only the vision projector to avoid image data overwriting language and speech abilities.Among training modes, all updates MiniMind / Talker / projectors, while audio_proj and vision_proj are used solely to align the corresponding projector. SenseVoice-Small, SigLIP2 and Mimi are kept frozen throughout. The Dense and MoE variants share the same data ordering. The mini commands are meant only to make the pipeline runnable end-to-end and finish in ~2 hours on a single RTX 3090 by default; the released weights correspond to full training.
T2A and A2A loss curves during full training are shown below for reference:

sft_t2a: text-to-speech-output path

sft_a2a: loss after speech inputs are added
Early spikes caused by an incompatible weight resume have been removed from the T2A curve. The MoE variant has more total parameters but a similar number of active parameters compared with Dense, which makes it more useful as a capacity-allocation reference.
| Format | ModelScope | HuggingFace |
|---|---|---|
PyTorch (*.pth) | minimind-3o-pytorch | minimind-3o-pytorch |
| Transformers | minimind-o collection | minimind-o collection |
The Transformers version contains both
minimind-3oandminimind-3o-moeand is suitable for direct use witheval_omni.pyand the WebUI. The native PyTorch weights are mainly intended for training, reproducing experiments and continued fine-tuning.
There is currently no unified evaluation protocol for Omni models: different works differ in the LLM backbone, the audio synthesizer and the system goal. Some focus on the LLM's own knowledge and reasoning and report MMLU, HumanEval and related benchmarks; some emphasize streaming latency and audio quality; some highlight speech-consistency metrics; and others focus on natural interaction or broader Omni generation. Most of these systems are continually trained from a state-of-the-art open-source LLM, while MiniMind's 0.06B backbone is clearly not competitive on complex knowledge QA, math reasoning, code generation or long open-ended replies, and the Talker's naturalness, prosody and stability are also weaker than those of full-scale systems.
Therefore, the goal here is not to chase a comprehensive leaderboard, but to focus on a few more reproducible local evaluations and use cases: a Talker hidden-size ablation, voice-cloning similarity, CER / WER comparisons under identical questions and identical ASR pipelines, and qualitative A2A, I2A and real-time interaction examples. CER / WER are mainly used to inspect text consistency, while audio quality, naturalness and human preference are left to qualitative samples and actual listening tests.
If only speech generation is considered, scaling Talker to 1024 / 2048 hidden size or stacking more layers would obviously be more stable. But MiniMind-O has to fit the entire Omni pipeline within ~0.1B parameters, and cannot afford to allocate most of the budget to the acoustic side. Once Thinker and Talker are decoupled, language understanding and cross-modal fusion are mainly carried by Thinker, while Talker only renders Mimi codes from semantic conditions; this makes a small Talker possible. The rendering here is not "predict semantic tokens and hand them off to an external acoustic model"—Talker directly produces decodable Mimi acoustic codes, so the real bottleneck is at the output side: it has to handle 8 Mimi codebooks rather than a single next-token-prediction stream.
384-d is tempting, since the dense version compresses to ~88M; 512-d is also lighter. But the table below shows that smaller does not automatically mean better allocated: short utterances remain acceptable, but medium-to-long ones are more prone to word drops, repetitions and pronunciation drift. 768-d was kept in the end because it matches the MiniMind backbone width and can be initialized from the last 4 layers of Thinker; the parameter count remains around 0.1B, training cost does not increase noticeably, and consistency is clearly more stable.
| Variant | Talker hidden | Params | Avg CER ↓ | Short ↓ | Mid / Long ↓ |
|---|---|---|---|---|---|
| Dense | 768 | 115.29M | 0.0897 | 0.1528 | 0.0874 / 0.0675 |
| Dense | 512 | 96.13M | 0.1745 | 0.2709 | 0.2455 / 0.0976 |
| Dense | 384 | 88.72M | 0.2767 | 0.3904 | 0.1865 / 0.4046 |
| MoE | 768 | 317.05M-A115.33M | 0.0900 | 0.2075 | 0.0533 / 0.0271 |
| MoE | 512 | 261.32M-A96.17M | 0.1265 | 0.0711 | 0.1490 / 0.1464 |
| MoE | 384 | 240.04M-A88.75M | 0.3280 | 0.3757 | 0.2777 / 0.4313 |
Dense and MoE CERs should not be compared directly across architectures: under the same question, the two Thinkers may produce different content with different lengths, leading to different synthesis difficulty for Talker. What matters more is the within-architecture trend: 768 clearly outperforms 512 and 384 in both cases.
Voice cloning is one of the more beta-quality features in this release. To our knowledge, most open-source Omni models support only fixed output voices, while minimind-3o tries to fit multi-voice generation into a single Talker. This goal is harder than simply "being able to talk", because the model needs not only to say the right content, but also to preserve speaker cues while generating Mimi codes.
Quality has not yet reached high-fidelity cloning: the same reference voice does not always stay consistent across questions, and longer utterances can drift because of pronunciation and rhythm issues. But basic male / female differences, intonation tendencies and parts of the prosody are distinguishable.
The CAM++ speaker-embedding cosine similarity below is only an automatic reference. Seen comes from the 5 built-in voices in voices.pt; Unseen comes from 7 voices in voices_unseen.pt that were never seen during training. Each voice uses the same set of text questions and only the voice condition is swapped.
Per-speaker breakdown:
| Split | Speaker | Dense ↑ | MoE ↑ |
|---|---|---|---|
| Seen | dylan | 0.6997 | 0.6837 |
| Seen | eric | 0.5289 | 0.4232 |
| Seen | serena | 0.7092 | 0.7041 |
| Seen | uncle_fu | 0.7241 | 0.7337 |
| Seen | vivian | 0.5744 | 0.5888 |
| Unseen | arthur | 0.7171 | 0.6750 |
| Unseen | chelsie | 0.6437 | 0.6240 |
| Unseen | cherry | 0.5689 | 0.5678 |
| Unseen | ethan | 0.4783 | 0.4847 |
| Unseen | jennifer | 0.4749 | 0.4003 |
| Unseen | momo | 0.6470 | 0.5720 |
| Unseen | moon | 0.4282 | 0.6673 |
Overall, minimind-3o and minimind-3o-moe land at similar averages, both slightly above the early baseline. This suggests that voice retention is not primarily determined by inactive expert capacity; the more direct factors are reference-clip quality, the separability of CAM++ embeddings, and the stability of Talker generation itself. Per-speaker, voices like uncle_fu, serena and arthur are easier to preserve, with at least one variant exceeding 0.70; outliers like eric and moon are more sensitive to generation quality. In other words, this capability already separates some speaker characteristics, but is still some distance away from a product-level "given a reference clip, faithfully reproduce its timbre" experience.
For a more direct listening test, seed=42 and temperature=0.7 are fixed, and one generated sample is shown per voice. The only variables are the reference audio codes and speaker embedding. As a control, the default output without any reference voice condition is shown first (the spoken text is identical across all samples):
https://github.com/user-attachments/assets/b31fd8f2-e3af-4fed-ba19-65424b59bec6
"Seen" means voices that appeared in training data, used to inspect how well the model preserves familiar speakers.
"Unseen" means voices not seen during training, used to inspect zero-shot transfer of a new reference voice into generated speech.
We selected 20 English questions, all constrained by Answer briefly in one short sentence. The intent is not to evaluate open-ended English ability, but to keep response lengths within a similar range. The three models then synthesize audio, which is uniformly transcribed by Qwen3-ASR; CER / WER between transcription and target text are used to compare Talker-side textual consistency.
| Length bucket | Mini-Omni CER/WER | Mini-Omni2 CER/WER | minimind-3o CER/WER |
|---|---|---|---|
| short (≤15w) | 0.0195 / 0.0384 (n=8) | 0.0503 / 0.0584 (n=14) | 0.0531 / 0.0417 (n=8) |
| mid (16–30w) | 0.0038 / 0.0052 (n=12) | 0.0062 / 0.0076 (n=6) | 0.1327 / 0.1420 (n=11) |
| long (31–60w) | — | — | 0.0431 / 0.0508 (n=1) |
For replies of ≤15 words, minimind-3o is already close to Mini-Omni2; the gap really opens up at 16–30 words. This length is no longer a simple phrase, and Talker must keep pronunciation, rhythm and surface form consistent in a complete short sentence simultaneously. This is also the regime where the current 0.1B Talker most easily exposes its instability.
Mini-Omni does not support a VL path, so the comparison is between Mini-Omni2 (0.5B) and minimind-3o (0.1B). On 9 synthetic images, both models generate English answers, which are then uniformly transcribed and used to compute CER / WER as a vision-to-speech consistency reference.
| Model | Params | Avg CER ↓ | Avg WER ↓ |
|---|---|---|---|
| Mini-Omni2 | 0.5B | 0.7609 | 0.9756 |
| minimind-3o | 0.1B | 0.8241 | 1.0293 |
These numbers should not be read as the absolute correctness of open-ended image description. Image captioning has many equivalent expressions, and synonym choices and word order both affect CER / WER, so high absolute values are expected. Under the same automatic pipeline, minimind-3o trails behind Mini-Omni2 but stays in the same order of magnitude, with roughly 1/5 the parameters.

In speech-to-speech samples, the input is real speech, Thinker organizes the semantics, and Talker renders speech. Short replies are again the more stable regime; Chinese explanatory questions usually produce coherent answers, while English pronunciation and rhythm are relatively more stable.
|
https://github.com/user-attachments/assets/c85809b2-4787-4656-9c7e-55b693798494 |
https://github.com/user-attachments/assets/354a5eec-c147-4d18-8c7a-942bd2a0b4b0 |

The image-QA samples chain visual encoding, text generation and speech rendering inside the same path. The current model usually captures the main object and the rough scene, but fine-grained spatial relations, counts and attributes are still often wrong, which makes it more suitable as a reproducible baseline for tiny-model Omni pipelines.
|
https://github.com/user-attachments/assets/244e08b0-5b12-449e-a7a2-2a2139c5d62d |
https://github.com/user-attachments/assets/3e8d0a76-282d-4a9d-9726-a954cf80198a |

This is the real-time interaction interface. Once the user stops speaking, Thinker first finishes the semantic-side prefill, Talker then starts to emit audio codes incrementally, and the Mimi decoder writes the 24 kHz waveform as it receives codes. The barge-in example shows another path that is closer to a real conversation: when the user starts speaking again while the model is talking, the system interrupts the current generation and re-enters the prefill–reply flow. Interruption detection here is still based on a simple VAD threshold, not yet semantic-level barge-in; but from an engineering loop-closure perspective, the system can already fall back from speaking to listening and process the next turn.
The current model still has clear gaps compared with large-scale Omni systems, and there is no need to gloss over them. Long-form speech naturalness, complex visual reasoning, open-ended English mid/long replies and voice stability are not its strong areas. The visual path is closer to a compact vision-to-speech link, and the MoE variant looks more like a capacity-allocation experiment than a same-FLOP optimum.
These limitations also point to several follow-ups: longer ICL contexts, finer prosody supervision, stronger vision encoders, more stable voice conditions, and systematic sweeps over the Bridge layer and the MTP codebook interface—all of which are worth continuing.
That said, the value of MiniMind-O lies exactly here. It compresses an entire Omni loop into the 0.1B regime, and ships code, weights and the main training data inside the same inspectable artifact. This means it is not just a demo, but a baseline small enough, transparent enough, and reproducible enough to rebuild from scratch and modify further. For people who want to understand Thinker–Talker decoupling, the MTP codebook interface, in-context voice cloning, and the middle-hidden bridge, it offers a set of design choices that can actually be verified by hand.
TIP
If you find MiniMind-O helpful, consider giving us a ⭐ on GitHub.
Given limited bandwidth there will inevitably be unknown bugs. Discussions, corrections and PRs in Issues are welcome.
Your support is what keeps the project moving—thank you!
If MiniMind-O helps your research or work, please cite:
% Cite the technical report when referencing the model design or experimental results.
@article{minimind-o-report,
title = {MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model},
author = {Jingyao Gong},
journal = {arXiv preprint arXiv:2605.03937},
year = {2026}
}
% Cite the GitHub repo when referencing the open-source codebase or released weights.
@misc{minimind-o,
title = {MiniMind-O: Train a Tiny Omni Model from Scratch},
author = {Jingyao Gong},
year = {2026},
url = {https://github.com/jingyaogong/minimind-o},
note = {GitHub repository, accessed 2026}
}
This repository is released under the Apache-2.0 License.