logo
1
0
WeChat Login

🎤 VoxID: VAD ASR Speech Recognition Server

中文文档

A high-performance speech recognition service based on Sherpa-ONNX, supporting real-time VAD (Voice Activity Detection), multi-language recognition, and speaker verification.

✨ Features

  • 🌍 Multi-language Support: Supports Chinese, English, Japanese, Korean, Cantonese, and more.
  • 🎯 Smart Voice Detection: Built-in VAD for automatic segmentation and silence filtering.
  • 🔊 Speaker Verification: Supports speaker registration and identification.
  • Real-time Communication: Low-latency real-time transmission based on WebSocket.
  • 📊 Health Monitoring: Provides health checks and status monitoring interfaces.

📋 Requirements

Basic Requirements

  • OS: Linux / macOS / Windows
  • Go Version: 1.21 or higher
  • Memory: 4GB+ recommended
  • Disk: At least 2GB available space (for model files)

Dependencies (Linux)

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y libc++1 libc++abi1 build-essential

# CentOS/RHEL
sudo yum install -y libcxx libcxxabi gcc gcc-c++

🚀 Quick Start

1. Clone the Project

git clone https://github.com/quyangminddock/VoxID.git
cd VoxID

2. Install Go Dependencies

go mod download

3. Download Model Files

⚠️ IMPORTANT: You must manually download the following model files for the project to run.

3.1 Download ASR Model (Required)

Option A: Using wget

# Create directory
mkdir -p models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17

# Download model file
wget -O models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
  https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/model.int8.onnx

# Download tokens file
wget -O models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/tokens.txt

Option B: Using git-lfs (Requires git-lfs installed)

# Install git-lfs
sudo apt-get install git-lfs  # Ubuntu/Debian
# OR
brew install git-lfs          # macOS

# Initialize git-lfs
git lfs install

# Clone model repository
git clone https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 \
  models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17

Mainland China Mirror Acceleration If HuggingFace download is slow, you can use a mirror site:

# Use HF-Mirror
export HF_ENDPOINT=https://hf-mirror.com

# Then run the download commands above
3.2 Download VAD Model (Required)

Silero VAD model file usually needs to be obtained from the project:

mkdir -p models/vad/silero_vad

# Download silero_vad.onnx
wget -O models/vad/silero_vad/silero_vad.onnx \
  https://github.com/snakers4/silero-vad/raw/master/files/silero_vad.onnx
3.3 Download Speaker Verification Model (Optional)

If speaker verification is needed:

mkdir -p models/speaker

# Download speaker model
wget -O models/speaker/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx \
  https://huggingface.co/csukuangfj/speaker-embedding-models/resolve/main/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx

If speaker verification is not needed, you can disable it in config.json:

{
  "speaker": {
    "enabled": false,
    ...
  }
}

4. Configure Dynamic Libraries (Linux)

# Copy dynamic libraries to system directory
sudo cp lib/*.so /usr/lib/
sudo cp lib/ten-vad/lib/Linux/x64/libten_vad.so /usr/lib/

# Or set LD_LIBRARY_PATH (Recommended)
export LD_LIBRARY_PATH=$PWD/lib:$PWD/lib/ten-vad/lib/Linux/x64:$LD_LIBRARY_PATH

5. Create Necessary Directories

mkdir -p logs data/speaker

6. Run the Service

# Option A: Run directly
go run main.go

# Option B: Build and run
go build -o asr_server
./asr_server

7. Verify Service

Visit http://localhost:8080/ to view the test page. Click the "Start System" button to begin the speech recognition test (page might receive translated UI update in future).


⚙️ Configuration

Main configuration file: config.json

VAD Configuration

The system supports two VAD engines:

Silero VAD (Default)

{
  "vad": {
    "provider": "silero_vad",
    "pool_size": 200,
    "threshold": 0.5,
    "silero_vad": {
      "model_path": "models/vad/silero_vad/silero_vad.onnx",
      "min_silence_duration": 0.1,
      "min_speech_duration": 0.25,
      "max_speech_duration": 8.0,
      "window_size": 512,
      "buffer_size_seconds": 10.0
    }
  }
}

Ten-VAD

{
  "vad": {
    "provider": "ten_vad",
    "ten_vad": {
      "hop_size": 512,
      "min_speech_frames": 12,
      "max_silence_frames": 5
    }
  }
}

ASR Configuration

{
  "recognition": {
    "model_path": "models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx",
    "tokens_path": "models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt",
    "language": "auto",
    "num_threads": 16,
    "provider": "cpu"
  }
}

Speaker Verification Configuration

{
  "speaker": {
    "enabled": true,
    "model_path": "models/speaker/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx",
    "num_threads": 8,
    "threshold": 0.6,
    "data_dir": "data/speaker"
  }
}

Server Configuration

{
  "server": {
    "port": 8080,
    "host": "0.0.0.0",
    "read_timeout": 20
  }
}

For more configuration options, please refer to the config.json file.


🔌 API Usage

WebSocket API

Connect to ws://localhost:8080/ws, and send audio data (16kHz, 16-bit PCM):

const ws = new WebSocket('ws://localhost:8080/ws');

ws.onopen = () => {
  console.log('WebSocket connection established');
  // Send audio buffer
  ws.send(audioBuffer);
};

ws.onmessage = (event) => {
  const result = JSON.parse(event.data);
  console.log('Recognition result:', result);
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = () => {
  console.log('WebSocket connection closed');
};

HTTP API

Health Check

curl http://localhost:8080/health

Status Monitoring

curl http://localhost:8080/stats

Speaker Registration

curl -X POST http://localhost:8080/api/speaker/register \
  -H "Content-Type: application/json" \
  -d '{"speaker_id": "user123", "audio_data": "..."}'

Speaker Recognition

curl -X POST http://localhost:8080/api/speaker/recognize \
  -H "Content-Type: application/json" \
  -d '{"audio_data": "..."}'

🧪 Testing

The project provides test scripts to verify functionality:

Single File Test

cd test/asr
python audiofile_test.py

Concurrent Stress Test

cd test/asr
python stress_test.py --connections 100 --audio-per-connection 2

Parameter description:

  • --connections: Number of concurrent connections
  • --audio-per-connection: Number of audio files sent per connection

🏛️ System Architecture

┌────────────────────┐    ┌──────────────────────┐    ┌────────────────────┐
│   WebSocket Client  │    │      VAD Pool        │    │     ASR Module     │
│                    │    │                      │    │ (Dynamic Stream)   │
│  ┌──────────────┐  │    │  ┌──────────────┐    │    │  ┌──────────────┐  │
│  │  Audio Stream │◄─┼───►│  │   VAD Inst   │◄──┼───►│  │  Recognizer  │  │
│  └──────────────┘  │    │  └──────────────┘    │    │  └──────────────┘  │
│  ┌──────────────┐  │    │  ┌──────────────┐    │    │                  │
│  │     Result    │  │    │  │ Buffer Queue │    │    │                  │
│  └──────────────┘  │    │  └──────────────┘    │    └────────────────────┘
└────────────────────┘    └──────────────────────┘             │
                                                               ▼
┌────────────────────┐    ┌──────────────────────┐    ┌────────────────────┐
│   Session Manager  │    │    Speaker ID Module │    │   Health/Monitor   │
│  ┌──────────────┐  │    │  ┌──────────────┐    │    │                    │
│  │  Conn State  │  │    │  │ Registration │    │    │   Status Stats     │
│  └──────────────┘  │    │  └──────────────┘    │    └────────────────────┘
│  ┌──────────────┐  │    │  ┌──────────────┐    │
│  │ Res Release  │  │    │  │ Feature Ext  │    │
│  └──────────────┘  │    │  └──────────────┘    │
└────────────────────┘    └──────────────────────┘

📂 Project Structure

VoxID/
├── main.go                 # Main entry point
├── config.json             # Configuration file
├── go.mod                  # Go module definition
├── go.sum
├── internal/               # Internal packages
│   ├── bootstrap/          # App startup/init
│   ├── logger/             # Logging module
│   ├── router/             # Router config
│   └── ...
├── lib/                    # Dynamic libraries
│   └── ten-vad/
├── models/                 # Model files (Download required)
│   ├── asr/                # ASR models
│   ├── vad/                # VAD models
│   └── speaker/            # Speaker models
├── static/                 # Static assets
│   ├── index.html          # Test page
│   ├── css/
│   └── js/
├── data/                   # Data storage
│   └── speaker/            # Speaker data
├── logs/                   # Log files
└── test/                   # Test scripts
    ├── asr/
    └── speaker/

🔧 FAQ

1. Model Download Failure

Issue: Downloading from HuggingFace is slow or fails.

Solution:

  • Use a mirror: export HF_ENDPOINT=https://hf-mirror.com
  • Use a proxy.
  • Download manually via browser and place in the corresponding directory.

2. Dynamic Library Load Failure

Issue: Runtime error saying .so file not found.

Solution:

# Set library path
export LD_LIBRARY_PATH=$PWD/lib:$PWD/lib/ten-vad/lib/Linux/x64:$LD_LIBRARY_PATH

# Or copy to system directory
sudo cp lib/*.so /usr/lib/

3. WebSocket Connection Failure

Issue: Frontend cannot connect to WebSocket.

Solution:

  • Check firewall settings, ensure port 8080 is open.
  • Check server.host in config.json.
  • Check logs in logs/app.log.

4. No Recognition Result

Issue: Audio sent successfully but no result returned.

Solution:

  • Confirm audio format: 16kHz, 16-bit PCM.
  • Adjust VAD parameters (threshold, min_speech_duration).
  • Check if audio actually contains speech.

5. High Memory Usage

Issue: Memory usage is high after running for a while.

Solution:

  • Adjust vad.pool_size.
  • Reduce pool.worker_count.
  • Enable rate_limit to limit concurrent connections.

📊 Performance Tuning

Key Parameters

ParameterDescriptionRecommendedImpact
vad.pool_sizeVAD Instance Pool Size200Affects concurrency capacity
recognition.num_threadsASR Threads8-16Affects recognition speed
pool.worker_countWorker Goroutines500Affects max connections
vad.thresholdVAD Threshold0.5Affects detection sensitivity
speaker.thresholdSpeaker Similarity Threshold0.6Affects identification accuracy

Optimization Tips

  1. CPU Optimization: Adjust num_threads based on core count.
  2. Memory Optimization: Reduce pool_size and worker_count.
  3. Latency Optimization: Use ten_vad instead of silero_vad.
  4. Concurrency Optimization: Enable rate_limit to prevent overload.

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License. However, please note:

  • If using ten-vad (vad.provider set to ten_vad), you must comply with ten-vad's License.
  • If using only silero-vad (vad.provider set to silero_vad), you can follow the MIT License directly.

Please comply with the corresponding open source license based on the VAD engine used.


🙏 Acknowledgements

This project is based on the following excellent open-source projects:


📞 Contact & Support

If you have any questions or suggestions, feel free to:


⭐ Star History

If this project helps you, please give it a Star ⭐️!

Star History Chart


Built with ❤️ by quyangminddock

About

一个基于 Sherpa-ONNX 的高性能语音识别服务,支持实时 VAD(语音活动检测)、多语言识别和声纹识别。

Language
Go48.7%
Python30.7%
JavaScript12.1%
CSS4.4%
Others4.1%