logo
1
0
WeChat Login

🎤 VoxID: VAD ASR Speech Recognition Server

中文文档

A high-performance speech recognition service based on Sherpa-ONNX, supporting real-time VAD (Voice Activity Detection), multi-language recognition, and speaker verification.

✨ Features

  • 🌍 Multi-language Support: Supports Chinese, English, Japanese, Korean, Cantonese, and more.
  • 🎯 Smart Voice Detection: Built-in VAD for automatic segmentation and silence filtering.
  • 🔊 Speaker Verification: Supports speaker registration and identification.
  • Real-time Communication: Low-latency real-time transmission based on WebSocket.
  • 📊 Health Monitoring: Provides health checks and status monitoring interfaces.

📋 Requirements

Basic Requirements

  • OS: Linux / macOS / Windows
  • Go Version: 1.21 or higher
  • Memory: 4GB+ recommended
  • Disk: At least 2GB available space (for model files)

Dependencies (Linux)

# Ubuntu/Debian sudo apt-get update sudo apt-get install -y libc++1 libc++abi1 build-essential # CentOS/RHEL sudo yum install -y libcxx libcxxabi gcc gcc-c++

🚀 Quick Start

1. Clone the Project

git clone https://github.com/quyangminddock/VoxID.git cd VoxID

2. Install Go Dependencies

go mod download

3. Download Model Files

⚠️ IMPORTANT: You must manually download the following model files for the project to run.

3.1 Download ASR Model (Required)

Option A: Using wget

# Create directory mkdir -p models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 # Download model file wget -O models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \ https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/model.int8.onnx # Download tokens file wget -O models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \ https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/tokens.txt

Option B: Using git-lfs (Requires git-lfs installed)

# Install git-lfs sudo apt-get install git-lfs # Ubuntu/Debian # OR brew install git-lfs # macOS # Initialize git-lfs git lfs install # Clone model repository git clone https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 \ models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17

Mainland China Mirror Acceleration If HuggingFace download is slow, you can use a mirror site:

# Use HF-Mirror export HF_ENDPOINT=https://hf-mirror.com # Then run the download commands above
3.2 Download VAD Model (Required)

Silero VAD model file usually needs to be obtained from the project:

mkdir -p models/vad/silero_vad # Download silero_vad.onnx wget -O models/vad/silero_vad/silero_vad.onnx \ https://github.com/snakers4/silero-vad/raw/master/files/silero_vad.onnx
3.3 Download Speaker Verification Model (Optional)

If speaker verification is needed:

mkdir -p models/speaker # Download speaker model wget -O models/speaker/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx \ https://huggingface.co/csukuangfj/speaker-embedding-models/resolve/main/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx

If speaker verification is not needed, you can disable it in config.json:

{ "speaker": { "enabled": false, ... } }

4. Configure Dynamic Libraries (Linux)

# Copy dynamic libraries to system directory sudo cp lib/*.so /usr/lib/ sudo cp lib/ten-vad/lib/Linux/x64/libten_vad.so /usr/lib/ # Or set LD_LIBRARY_PATH (Recommended) export LD_LIBRARY_PATH=$PWD/lib:$PWD/lib/ten-vad/lib/Linux/x64:$LD_LIBRARY_PATH

5. Create Necessary Directories

mkdir -p logs data/speaker

6. Run the Service

# Option A: Run directly go run main.go # Option B: Build and run go build -o asr_server ./asr_server

7. Verify Service

Visit http://localhost:8080/ to view the test page. Click the "Start System" button to begin the speech recognition test (page might receive translated UI update in future).


⚙️ Configuration

Main configuration file: config.json

VAD Configuration

The system supports two VAD engines:

Silero VAD (Default)

{ "vad": { "provider": "silero_vad", "pool_size": 200, "threshold": 0.5, "silero_vad": { "model_path": "models/vad/silero_vad/silero_vad.onnx", "min_silence_duration": 0.1, "min_speech_duration": 0.25, "max_speech_duration": 8.0, "window_size": 512, "buffer_size_seconds": 10.0 } } }

Ten-VAD

{ "vad": { "provider": "ten_vad", "ten_vad": { "hop_size": 512, "min_speech_frames": 12, "max_silence_frames": 5 } } }

ASR Configuration

{ "recognition": { "model_path": "models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx", "tokens_path": "models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt", "language": "auto", "num_threads": 16, "provider": "cpu" } }

Speaker Verification Configuration

{ "speaker": { "enabled": true, "model_path": "models/speaker/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx", "num_threads": 8, "threshold": 0.6, "data_dir": "data/speaker" } }

Server Configuration

{ "server": { "port": 8080, "host": "0.0.0.0", "read_timeout": 20 } }

For more configuration options, please refer to the config.json file.


🔌 API Usage

WebSocket API

Connect to ws://localhost:8080/ws, and send audio data (16kHz, 16-bit PCM):

const ws = new WebSocket('ws://localhost:8080/ws'); ws.onopen = () => { console.log('WebSocket connection established'); // Send audio buffer ws.send(audioBuffer); }; ws.onmessage = (event) => { const result = JSON.parse(event.data); console.log('Recognition result:', result); }; ws.onerror = (error) => { console.error('WebSocket error:', error); }; ws.onclose = () => { console.log('WebSocket connection closed'); };

HTTP API

Health Check

curl http://localhost:8080/health

Status Monitoring

curl http://localhost:8080/stats

Speaker Registration

curl -X POST http://localhost:8080/api/speaker/register \ -H "Content-Type: application/json" \ -d '{"speaker_id": "user123", "audio_data": "..."}'

Speaker Recognition

curl -X POST http://localhost:8080/api/speaker/recognize \ -H "Content-Type: application/json" \ -d '{"audio_data": "..."}'

🧪 Testing

The project provides test scripts to verify functionality:

Single File Test

cd test/asr python audiofile_test.py

Concurrent Stress Test

cd test/asr python stress_test.py --connections 100 --audio-per-connection 2

Parameter description:

  • --connections: Number of concurrent connections
  • --audio-per-connection: Number of audio files sent per connection

🏛️ System Architecture

┌────────────────────┐ ┌──────────────────────┐ ┌────────────────────┐ │ WebSocket Client │ │ VAD Pool │ │ ASR Module │ │ │ │ │ │ (Dynamic Stream) │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ Audio Stream │◄─┼───►│ │ VAD Inst │◄──┼───►│ │ Recognizer │ │ │ └──────────────┘ │ │ └──────────────┘ │ │ └──────────────┘ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ │ │ Result │ │ │ │ Buffer Queue │ │ │ │ │ └──────────────┘ │ │ └──────────────┘ │ └────────────────────┘ └────────────────────┘ └──────────────────────┘ │ ▼ ┌────────────────────┐ ┌──────────────────────┐ ┌────────────────────┐ │ Session Manager │ │ Speaker ID Module │ │ Health/Monitor │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ │ │ Conn State │ │ │ │ Registration │ │ │ Status Stats │ │ └──────────────┘ │ │ └──────────────┘ │ └────────────────────┘ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │ Res Release │ │ │ │ Feature Ext │ │ │ └──────────────┘ │ │ └──────────────┘ │ └────────────────────┘ └──────────────────────┘

📂 Project Structure

VoxID/ ├── main.go # Main entry point ├── config.json # Configuration file ├── go.mod # Go module definition ├── go.sum ├── internal/ # Internal packages │ ├── bootstrap/ # App startup/init │ ├── logger/ # Logging module │ ├── router/ # Router config │ └── ... ├── lib/ # Dynamic libraries │ └── ten-vad/ ├── models/ # Model files (Download required) │ ├── asr/ # ASR models │ ├── vad/ # VAD models │ └── speaker/ # Speaker models ├── static/ # Static assets │ ├── index.html # Test page │ ├── css/ │ └── js/ ├── data/ # Data storage │ └── speaker/ # Speaker data ├── logs/ # Log files └── test/ # Test scripts ├── asr/ └── speaker/

🔧 FAQ

1. Model Download Failure

Issue: Downloading from HuggingFace is slow or fails.

Solution:

  • Use a mirror: export HF_ENDPOINT=https://hf-mirror.com
  • Use a proxy.
  • Download manually via browser and place in the corresponding directory.

2. Dynamic Library Load Failure

Issue: Runtime error saying .so file not found.

Solution:

# Set library path export LD_LIBRARY_PATH=$PWD/lib:$PWD/lib/ten-vad/lib/Linux/x64:$LD_LIBRARY_PATH # Or copy to system directory sudo cp lib/*.so /usr/lib/

3. WebSocket Connection Failure

Issue: Frontend cannot connect to WebSocket.

Solution:

  • Check firewall settings, ensure port 8080 is open.
  • Check server.host in config.json.
  • Check logs in logs/app.log.

4. No Recognition Result

Issue: Audio sent successfully but no result returned.

Solution:

  • Confirm audio format: 16kHz, 16-bit PCM.
  • Adjust VAD parameters (threshold, min_speech_duration).
  • Check if audio actually contains speech.

5. High Memory Usage

Issue: Memory usage is high after running for a while.

Solution:

  • Adjust vad.pool_size.
  • Reduce pool.worker_count.
  • Enable rate_limit to limit concurrent connections.

📊 Performance Tuning

Key Parameters

ParameterDescriptionRecommendedImpact
vad.pool_sizeVAD Instance Pool Size200Affects concurrency capacity
recognition.num_threadsASR Threads8-16Affects recognition speed
pool.worker_countWorker Goroutines500Affects max connections
vad.thresholdVAD Threshold0.5Affects detection sensitivity
speaker.thresholdSpeaker Similarity Threshold0.6Affects identification accuracy

Optimization Tips

  1. CPU Optimization: Adjust num_threads based on core count.
  2. Memory Optimization: Reduce pool_size and worker_count.
  3. Latency Optimization: Use ten_vad instead of silero_vad.
  4. Concurrency Optimization: Enable rate_limit to prevent overload.

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License. However, please note:

  • If using ten-vad (vad.provider set to ten_vad), you must comply with ten-vad's License.
  • If using only silero-vad (vad.provider set to silero_vad), you can follow the MIT License directly.

Please comply with the corresponding open source license based on the VAD engine used.


🙏 Acknowledgements

This project is based on the following excellent open-source projects:


📞 Contact & Support

If you have any questions or suggestions, feel free to:


⭐ Star History

If this project helps you, please give it a Star ⭐️!

Star History Chart


Built with ❤️ by quyangminddock

About

一个基于 Sherpa-ONNX 的高性能语音识别服务,支持实时 VAD(语音活动检测)、多语言识别和声纹识别。

Language
Go48.7%
Python30.7%
JavaScript12.1%
CSS4.4%
Others4.1%