This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization.
The main improvements brought by Community-1 are:
pip install pyannote.audiohf.co/settings/tokens.# download the pipeline from Huggingface
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-community-1",
token="{huggingface-token}")
# run the pipeline locally on your computer
output = pipeline("audio.wav")
# print the predicted speaker diarization
for turn, speaker in output.speaker_diarization:
print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")
Out of the box, Community-1 is much better than speaker-diarization-3.1.
We report diarization error rates (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).
| Benchmark (last updated in 2025-09) | legacy (3.1) | community-1 | precision-2 |
|---|---|---|---|
| AISHELL-4 | 12.2 | 11.7 | 11.4 |
| AliMeeting (channel 1) | 24.5 | 20.3 | 15.2 |
| AMI (IHM) | 18.8 | 17.0 | 12.9 |
| AMI (SDM) | 22.7 | 19.9 | 15.6 |
| AVA-AVD | 49.7 | 44.6 | 37.1 |
| CALLHOME (part 2) | 28.5 | 26.7 | 16.6 |
| DIHARD 3 (full) | 21.4 | 20.2 | 14.7 |
| Ego4D (dev.) | 51.2 | 46.8 | 39.0 |
| MSDWild | 25.4 | 22.8 | 17.3 |
| RAMC | 22.2 | 20.8 | 10.5 |
| REPERE (phase2) | 7.9 | 8.9 | 7.4 |
| VoxConverse (v0.3) | 11.2 | 11.2 | 8.5 |
Precision-2 model is even better and can be tested like this:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
- 'pyannote/speaker-diarization-community-1', token="{huggingface-token}")
+ 'pyannote/speaker-diarization-precision-2', token="{pyannoteAI-api-key}")
diarization = pipeline("audio.wav") # runs on pyannoteAI servers
pyannote.audio pipelines run on CPU by default.
You can send them to GPU with the following lines:
import torch
pipeline.to(torch.device("cuda"))
Pre-loading audio files in memory may result in faster processing:
waveform, sample_rate = torchaudio.load("audio.wav")
output = pipeline({"waveform": waveform, "sample_rate": sample_rate})
Hooks are available to monitor the progress of the pipeline:
from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
output = pipeline("audio.wav", hook=hook)
In case the number of speakers is known in advance, one can use the num_speakers option:
output = pipeline("audio.wav", num_speakers=2)
One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:
output = pipeline("audio.wav", min_speakers=2, max_speakers=5)
Community-1 pretrained pipeline returns a new exclusive speaker diarization, on top of the regular speaker diarization, available as output.exclusive_speaker_diarization.
This is a feature which is backported from our latest commercial model that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.
# make sure git-lfs is installed (https://git-lfs.com)
git lfs install
# create a directory on disk
mkdir /path/to/directory
# when prompted for a password, use an access token with write permissions.
# generate one from your settings: https://huggingface.co/settings/tokens
git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
# load pipeline from disk (works without internet connection)
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1')
# run the pipeline locally on your computer
output = pipeline("audio.wav")
@inproceedings{Plaquet23, author={Alexis Plaquet and Hervé Bredin}, title={{Powerset multi-class cross entropy loss for neural speaker diarization}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, }
@inproceedings{Wang2023, title={Wespeaker: A research and production oriented speaker embedding learning toolkit}, author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin}, booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--5}, year={2023}, organization={IEEE} }
@article{Landini2022, author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}}, title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}}, year={2022}, journal={Computer Speech \& Language}, }
Training and tuning made possible thanks to GENCI on the Jean Zay supercomputer.