logo
0
0
WeChat Login
Cherrytest<cherrytest@163.com>
Upload folder using ModelScope SDK

community-1 speaker diarization

This pipeline ingests mono audio sampled at 16kHz and outputs speaker diarization.

  • stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels.
  • audio files sampled at a different rate are resampled to 16kHz automatically upon loading.

The main improvements brought by Community-1 are:

  • improved speaker assignment and counting
  • simpler reconciliation with transcription timestamps with exclusive speaker diarization
  • easy offline use (i.e. without internet connection)
  • (optionally) hosted on pyannoteAI cloud

Setup

  1. pip install pyannote.audio
  2. Accept user conditions
  3. Create access token at hf.co/settings/tokens.

Quick start

# download the pipeline from Huggingface from pyannote.audio import Pipeline pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-community-1", token="{huggingface-token}") # run the pipeline locally on your computer output = pipeline("audio.wav") # print the predicted speaker diarization for turn, speaker in output.speaker_diarization: print(f"{speaker} speaks between t={turn.start:.3f}s and t={turn.end:.3f}s")

Benchmark

Out of the box, Community-1 is much better than speaker-diarization-3.1.

We report diarization error rates (in %) on large collection of academic benchmarks (fully automatic processing, no forgiveness collar, nor skipping overlapping speech).

Benchmark (last updated in 2025-09)legacy (3.1)community-1precision-2
AISHELL-412.211.711.4
AliMeeting (channel 1)24.520.315.2
AMI (IHM)18.817.012.9
AMI (SDM)22.719.915.6
AVA-AVD49.744.637.1
CALLHOME (part 2)28.526.716.6
DIHARD 3 (full)21.420.214.7
Ego4D (dev.)51.246.839.0
MSDWild25.422.817.3
RAMC22.220.810.5
REPERE (phase2)7.98.97.4
VoxConverse (v0.3)11.211.28.5

Precision-2 model is even better and can be tested like this:

  1. Create an API key on pyannoteAI dashboard (free credits included)
  2. Change one line of code
from pyannote.audio import Pipeline pipeline = Pipeline.from_pretrained( - 'pyannote/speaker-diarization-community-1', token="{huggingface-token}") + 'pyannote/speaker-diarization-precision-2', token="{pyannoteAI-api-key}") diarization = pipeline("audio.wav") # runs on pyannoteAI servers

Processing on GPU

pyannote.audio pipelines run on CPU by default. You can send them to GPU with the following lines:

import torch pipeline.to(torch.device("cuda"))

Processing from memory

Pre-loading audio files in memory may result in faster processing:

waveform, sample_rate = torchaudio.load("audio.wav") output = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Monitoring progress

Hooks are available to monitor the progress of the pipeline:

from pyannote.audio.pipelines.utils.hook import ProgressHook with ProgressHook() as hook: output = pipeline("audio.wav", hook=hook)

Controlling the number of speakers

In case the number of speakers is known in advance, one can use the num_speakers option:

output = pipeline("audio.wav", num_speakers=2)

One can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options:

output = pipeline("audio.wav", min_speakers=2, max_speakers=5)

Exclusive speaker diarization

Community-1 pretrained pipeline returns a new exclusive speaker diarization, on top of the regular speaker diarization, available as output.exclusive_speaker_diarization.

This is a feature which is backported from our latest commercial model that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.

Offline use

  1. In the terminal, copy the pipeline on disk:
# make sure git-lfs is installed (https://git-lfs.com) git lfs install # create a directory on disk mkdir /path/to/directory # when prompted for a password, use an access token with write permissions. # generate one from your settings: https://huggingface.co/settings/tokens git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
  1. In Python, use the pipeline without internet connection:
# load pipeline from disk (works without internet connection) from pyannote.audio import Pipeline pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1') # run the pipeline locally on your computer output = pipeline("audio.wav")

Citations

  1. Speaker segmentation model
@inproceedings{Plaquet23, author={Alexis Plaquet and Hervé Bredin}, title={{Powerset multi-class cross entropy loss for neural speaker diarization}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, }
  1. Speaker embedding model
@inproceedings{Wang2023, title={Wespeaker: A research and production oriented speaker embedding learning toolkit}, author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin}, booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--5}, year={2023}, organization={IEEE} }
  1. Speaker clustering
@article{Landini2022, author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}}, title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}}, year={2022}, journal={Computer Speech \& Language}, }

Acknowledgment

Training and tuning made possible thanks to GENCI on the Jean Zay supercomputer.