A high-performance speech recognition service based on Sherpa-ONNX, supporting real-time VAD (Voice Activity Detection), multi-language recognition, and speaker verification.
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y libc++1 libc++abi1 build-essential
# CentOS/RHEL
sudo yum install -y libcxx libcxxabi gcc gcc-c++
git clone https://github.com/quyangminddock/VoxID.git
cd VoxID
go mod download
⚠️ IMPORTANT: You must manually download the following model files for the project to run.
Option A: Using wget
# Create directory
mkdir -p models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
# Download model file
wget -O models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/model.int8.onnx
# Download tokens file
wget -O models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/tokens.txt
Option B: Using git-lfs (Requires git-lfs installed)
# Install git-lfs
sudo apt-get install git-lfs # Ubuntu/Debian
# OR
brew install git-lfs # macOS
# Initialize git-lfs
git lfs install
# Clone model repository
git clone https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 \
models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17
Mainland China Mirror Acceleration If HuggingFace download is slow, you can use a mirror site:
# Use HF-Mirror
export HF_ENDPOINT=https://hf-mirror.com
# Then run the download commands above
Silero VAD model file usually needs to be obtained from the project:
mkdir -p models/vad/silero_vad
# Download silero_vad.onnx
wget -O models/vad/silero_vad/silero_vad.onnx \
https://github.com/snakers4/silero-vad/raw/master/files/silero_vad.onnx
If speaker verification is needed:
mkdir -p models/speaker
# Download speaker model
wget -O models/speaker/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx \
https://huggingface.co/csukuangfj/speaker-embedding-models/resolve/main/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx
If speaker verification is not needed, you can disable it in config.json:
{
"speaker": {
"enabled": false,
...
}
}
# Copy dynamic libraries to system directory
sudo cp lib/*.so /usr/lib/
sudo cp lib/ten-vad/lib/Linux/x64/libten_vad.so /usr/lib/
# Or set LD_LIBRARY_PATH (Recommended)
export LD_LIBRARY_PATH=$PWD/lib:$PWD/lib/ten-vad/lib/Linux/x64:$LD_LIBRARY_PATH
mkdir -p logs data/speaker
# Option A: Run directly
go run main.go
# Option B: Build and run
go build -o asr_server
./asr_server
Visit http://localhost:8080/ to view the test page. Click the "Start System" button to begin the speech recognition test (page might receive translated UI update in future).
Main configuration file: config.json
The system supports two VAD engines:
{
"vad": {
"provider": "silero_vad",
"pool_size": 200,
"threshold": 0.5,
"silero_vad": {
"model_path": "models/vad/silero_vad/silero_vad.onnx",
"min_silence_duration": 0.1,
"min_speech_duration": 0.25,
"max_speech_duration": 8.0,
"window_size": 512,
"buffer_size_seconds": 10.0
}
}
}
{
"vad": {
"provider": "ten_vad",
"ten_vad": {
"hop_size": 512,
"min_speech_frames": 12,
"max_silence_frames": 5
}
}
}
{
"recognition": {
"model_path": "models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx",
"tokens_path": "models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt",
"language": "auto",
"num_threads": 16,
"provider": "cpu"
}
}
{
"speaker": {
"enabled": true,
"model_path": "models/speaker/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx",
"num_threads": 8,
"threshold": 0.6,
"data_dir": "data/speaker"
}
}
{
"server": {
"port": 8080,
"host": "0.0.0.0",
"read_timeout": 20
}
}
For more configuration options, please refer to the config.json file.
Connect to ws://localhost:8080/ws, and send audio data (16kHz, 16-bit PCM):
const ws = new WebSocket('ws://localhost:8080/ws');
ws.onopen = () => {
console.log('WebSocket connection established');
// Send audio buffer
ws.send(audioBuffer);
};
ws.onmessage = (event) => {
const result = JSON.parse(event.data);
console.log('Recognition result:', result);
};
ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
ws.onclose = () => {
console.log('WebSocket connection closed');
};
curl http://localhost:8080/health
curl http://localhost:8080/stats
curl -X POST http://localhost:8080/api/speaker/register \
-H "Content-Type: application/json" \
-d '{"speaker_id": "user123", "audio_data": "..."}'
curl -X POST http://localhost:8080/api/speaker/recognize \
-H "Content-Type: application/json" \
-d '{"audio_data": "..."}'
The project provides test scripts to verify functionality:
cd test/asr
python audiofile_test.py
cd test/asr
python stress_test.py --connections 100 --audio-per-connection 2
Parameter description:
--connections: Number of concurrent connections--audio-per-connection: Number of audio files sent per connection┌────────────────────┐ ┌──────────────────────┐ ┌────────────────────┐ │ WebSocket Client │ │ VAD Pool │ │ ASR Module │ │ │ │ │ │ (Dynamic Stream) │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ Audio Stream │◄─┼───►│ │ VAD Inst │◄──┼───►│ │ Recognizer │ │ │ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ │ │ Result │ │ │ │ Buffer Queue │ │ │ │ │ └──────────────┘ │ │ └──────────────┘ │ └────────────────────┘ └────────────────────┘ └──────────────────────┘ │ ▼ ┌────────────────────┐ ┌──────────────────────┐ ┌────────────────────┐ │ Session Manager │ │ Speaker ID Module │ │ Health/Monitor │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ │ │ Conn State │ │ │ │ Registration │ │ │ Status Stats │ │ └──────────────┘ │ │ └──────────────┘ │ └────────────────────┘ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ Res Release │ │ │ │ Feature Ext │ │ │ └──────────────┘ │ │ └──────────────┘ │ └────────────────────┘ └──────────────────────┘
VoxID/ ├── main.go # Main entry point ├── config.json # Configuration file ├── go.mod # Go module definition ├── go.sum ├── internal/ # Internal packages │ ├── bootstrap/ # App startup/init │ ├── logger/ # Logging module │ ├── router/ # Router config │ └── ... ├── lib/ # Dynamic libraries │ └── ten-vad/ ├── models/ # Model files (Download required) │ ├── asr/ # ASR models │ ├── vad/ # VAD models │ └── speaker/ # Speaker models ├── static/ # Static assets │ ├── index.html # Test page │ ├── css/ │ └── js/ ├── data/ # Data storage │ └── speaker/ # Speaker data ├── logs/ # Log files └── test/ # Test scripts ├── asr/ └── speaker/
Issue: Downloading from HuggingFace is slow or fails.
Solution:
export HF_ENDPOINT=https://hf-mirror.comIssue: Runtime error saying .so file not found.
Solution:
# Set library path
export LD_LIBRARY_PATH=$PWD/lib:$PWD/lib/ten-vad/lib/Linux/x64:$LD_LIBRARY_PATH
# Or copy to system directory
sudo cp lib/*.so /usr/lib/
Issue: Frontend cannot connect to WebSocket.
Solution:
server.host in config.json.logs/app.log.Issue: Audio sent successfully but no result returned.
Solution:
threshold, min_speech_duration).Issue: Memory usage is high after running for a while.
Solution:
vad.pool_size.pool.worker_count.rate_limit to limit concurrent connections.| Parameter | Description | Recommended | Impact |
|---|---|---|---|
vad.pool_size | VAD Instance Pool Size | 200 | Affects concurrency capacity |
recognition.num_threads | ASR Threads | 8-16 | Affects recognition speed |
pool.worker_count | Worker Goroutines | 500 | Affects max connections |
vad.threshold | VAD Threshold | 0.5 | Affects detection sensitivity |
speaker.threshold | Speaker Similarity Threshold | 0.6 | Affects identification accuracy |
num_threads based on core count.pool_size and worker_count.ten_vad instead of silero_vad.rate_limit to prevent overload.Contributions are welcome! Please follow these steps:
git checkout -b feature/AmazingFeature)git commit -m 'Add some AmazingFeature')git push origin feature/AmazingFeature)This project is licensed under the MIT License. However, please note:
vad.provider set to ten_vad), you must comply with ten-vad's License.vad.provider set to silero_vad), you can follow the MIT License directly.Please comply with the corresponding open source license based on the VAD engine used.
This project is based on the following excellent open-source projects:
If you have any questions or suggestions, feel free to:
If this project helps you, please give it a Star ⭐️!