MindDockAI/VoxID

Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

quyang<jintli@qq.com>

add en

88eb1fcb

2 commits

config
internal
lib
scripts
static
test
.gitignore
README.md
README_zh.md
config.json
go.mod
go.sum
main.go

🎤 VoxID: VAD ASR Speech Recognition Server

中文文档

A high-performance speech recognition service based on Sherpa-ONNX, supporting real-time VAD (Voice Activity Detection), multi-language recognition, and speaker verification.

✨ Features

🌍 Multi-language Support: Supports Chinese, English, Japanese, Korean, Cantonese, and more.
🎯 Smart Voice Detection: Built-in VAD for automatic segmentation and silence filtering.
🔊 Speaker Verification: Supports speaker registration and identification.
⚡ Real-time Communication: Low-latency real-time transmission based on WebSocket.
📊 Health Monitoring: Provides health checks and status monitoring interfaces.

📋 Requirements

Basic Requirements

OS: Linux / macOS / Windows
Go Version: 1.21 or higher
Memory: 4GB+ recommended
Disk: At least 2GB available space (for model files)

Dependencies (Linux)

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y libc++1 libc++abi1 build-essential

# CentOS/RHEL
sudo yum install -y libcxx libcxxabi gcc gcc-c++

🚀 Quick Start

1. Clone the Project

git clone https://github.com/quyangminddock/VoxID.git
cd VoxID

2. Install Go Dependencies

go mod download

3. Download Model Files

⚠️ IMPORTANT: You must manually download the following model files for the project to run.

3.1 Download ASR Model (Required)

Option A: Using wget

# Create directory
mkdir -p models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17

# Download model file
wget -O models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx \
  https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/model.int8.onnx

# Download tokens file
wget -O models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt \
  https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/resolve/main/tokens.txt

Option B: Using git-lfs (Requires git-lfs installed)

# Install git-lfs
sudo apt-get install git-lfs  # Ubuntu/Debian
# OR
brew install git-lfs          # macOS

# Initialize git-lfs
git lfs install

# Clone model repository
git clone https://huggingface.co/csukuangfj/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17 \
  models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17

Mainland China Mirror Acceleration If HuggingFace download is slow, you can use a mirror site:

# Use HF-Mirror
export HF_ENDPOINT=https://hf-mirror.com

# Then run the download commands above

3.2 Download VAD Model (Required)

Silero VAD model file usually needs to be obtained from the project:

mkdir -p models/vad/silero_vad

# Download silero_vad.onnx
wget -O models/vad/silero_vad/silero_vad.onnx \
  https://github.com/snakers4/silero-vad/raw/master/files/silero_vad.onnx

3.3 Download Speaker Verification Model (Optional)

If speaker verification is needed:

mkdir -p models/speaker

# Download speaker model
wget -O models/speaker/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx \
  https://huggingface.co/csukuangfj/speaker-embedding-models/resolve/main/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx

If speaker verification is not needed, you can disable it in config.json:

{
  "speaker": {
    "enabled": false,
    ...
  }
}

4. Configure Dynamic Libraries (Linux)

# Copy dynamic libraries to system directory
sudo cp lib/*.so /usr/lib/
sudo cp lib/ten-vad/lib/Linux/x64/libten_vad.so /usr/lib/

# Or set LD_LIBRARY_PATH (Recommended)
export LD_LIBRARY_PATH=$PWD/lib:$PWD/lib/ten-vad/lib/Linux/x64:$LD_LIBRARY_PATH

5. Create Necessary Directories

mkdir -p logs data/speaker

6. Run the Service

# Option A: Run directly
go run main.go

# Option B: Build and run
go build -o asr_server
./asr_server

7. Verify Service

Visit http://localhost:8080/ to view the test page. Click the "Start System" button to begin the speech recognition test (page might receive translated UI update in future).

⚙️ Configuration

Main configuration file: config.json

VAD Configuration

The system supports two VAD engines:

Silero VAD (Default)

{
  "vad": {
    "provider": "silero_vad",
    "pool_size": 200,
    "threshold": 0.5,
    "silero_vad": {
      "model_path": "models/vad/silero_vad/silero_vad.onnx",
      "min_silence_duration": 0.1,
      "min_speech_duration": 0.25,
      "max_speech_duration": 8.0,
      "window_size": 512,
      "buffer_size_seconds": 10.0
    }
  }
}

Ten-VAD

{
  "vad": {
    "provider": "ten_vad",
    "ten_vad": {
      "hop_size": 512,
      "min_speech_frames": 12,
      "max_silence_frames": 5
    }
  }
}

ASR Configuration

{
  "recognition": {
    "model_path": "models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/model.int8.onnx",
    "tokens_path": "models/asr/sherpa-onnx-sense-voice-zh-en-ja-ko-yue-2024-07-17/tokens.txt",
    "language": "auto",
    "num_threads": 16,
    "provider": "cpu"
  }
}

Speaker Verification Configuration

{
  "speaker": {
    "enabled": true,
    "model_path": "models/speaker/3dspeaker_speech_campplus_sv_zh_en_16k-common_advanced.onnx",
    "num_threads": 8,
    "threshold": 0.6,
    "data_dir": "data/speaker"
  }
}

Server Configuration

{
  "server": {
    "port": 8080,
    "host": "0.0.0.0",
    "read_timeout": 20
  }
}

For more configuration options, please refer to the config.json file.

🔌 API Usage

WebSocket API

Connect to ws://localhost:8080/ws, and send audio data (16kHz, 16-bit PCM):

const ws = new WebSocket('ws://localhost:8080/ws');

ws.onopen = () => {
  console.log('WebSocket connection established');
  // Send audio buffer
  ws.send(audioBuffer);
};

ws.onmessage = (event) => {
  const result = JSON.parse(event.data);
  console.log('Recognition result:', result);
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

ws.onclose = () => {
  console.log('WebSocket connection closed');
};

HTTP API

Health Check

curl http://localhost:8080/health

Status Monitoring

curl http://localhost:8080/stats

Speaker Registration

curl -X POST http://localhost:8080/api/speaker/register \
  -H "Content-Type: application/json" \
  -d '{"speaker_id": "user123", "audio_data": "..."}'

Speaker Recognition

curl -X POST http://localhost:8080/api/speaker/recognize \
  -H "Content-Type: application/json" \
  -d '{"audio_data": "..."}'

🧪 Testing

The project provides test scripts to verify functionality:

Single File Test

cd test/asr
python audiofile_test.py

Concurrent Stress Test

cd test/asr
python stress_test.py --connections 100 --audio-per-connection 2

Parameter description:

--connections: Number of concurrent connections
--audio-per-connection: Number of audio files sent per connection

🏛️ System Architecture

┌────────────────────┐    ┌──────────────────────┐    ┌────────────────────┐
│   WebSocket Client  │    │      VAD Pool        │    │     ASR Module     │
│                    │    │                      │    │ (Dynamic Stream)   │
│  ┌──────────────┐  │    │  ┌──────────────┐    │    │  ┌──────────────┐  │
│  │  Audio Stream │◄─┼───►│  │   VAD Inst   │◄──┼───►│  │  Recognizer  │  │
│  └──────────────┘  │    │  └──────────────┘    │    │  └──────────────┘  │
│  ┌──────────────┐  │    │  ┌──────────────┐    │    │                  │
│  │     Result    │  │    │  │ Buffer Queue │    │    │                  │
│  └──────────────┘  │    │  └──────────────┘    │    └────────────────────┘
└────────────────────┘    └──────────────────────┘             │
                                                               ▼
┌────────────────────┐    ┌──────────────────────┐    ┌────────────────────┐
│   Session Manager  │    │    Speaker ID Module │    │   Health/Monitor   │
│  ┌──────────────┐  │    │  ┌──────────────┐    │    │                    │
│  │  Conn State  │  │    │  │ Registration │    │    │   Status Stats     │
│  └──────────────┘  │    │  └──────────────┘    │    └────────────────────┘
│  ┌──────────────┐  │    │  ┌──────────────┐    │
│  │ Res Release  │  │    │  │ Feature Ext  │    │
│  └──────────────┘  │    │  └──────────────┘    │
└────────────────────┘    └──────────────────────┘

📂 Project Structure

VoxID/
├── main.go                 # Main entry point
├── config.json             # Configuration file
├── go.mod                  # Go module definition
├── go.sum
├── internal/               # Internal packages
│   ├── bootstrap/          # App startup/init
│   ├── logger/             # Logging module
│   ├── router/             # Router config
│   └── ...
├── lib/                    # Dynamic libraries
│   └── ten-vad/
├── models/                 # Model files (Download required)
│   ├── asr/                # ASR models
│   ├── vad/                # VAD models
│   └── speaker/            # Speaker models
├── static/                 # Static assets
│   ├── index.html          # Test page
│   ├── css/
│   └── js/
├── data/                   # Data storage
│   └── speaker/            # Speaker data
├── logs/                   # Log files
└── test/                   # Test scripts
    ├── asr/
    └── speaker/

🔧 FAQ

1. Model Download Failure

Issue: Downloading from HuggingFace is slow or fails.

Solution:

Use a mirror: export HF_ENDPOINT=https://hf-mirror.com
Use a proxy.
Download manually via browser and place in the corresponding directory.

2. Dynamic Library Load Failure

Issue: Runtime error saying .so file not found.

Solution:

# Set library path
export LD_LIBRARY_PATH=$PWD/lib:$PWD/lib/ten-vad/lib/Linux/x64:$LD_LIBRARY_PATH

# Or copy to system directory
sudo cp lib/*.so /usr/lib/

3. WebSocket Connection Failure

Issue: Frontend cannot connect to WebSocket.

Solution:

Check firewall settings, ensure port 8080 is open.
Check server.host in config.json.
Check logs in logs/app.log.

4. No Recognition Result

Issue: Audio sent successfully but no result returned.

Solution:

Confirm audio format: 16kHz, 16-bit PCM.
Adjust VAD parameters (threshold, min_speech_duration).
Check if audio actually contains speech.

5. High Memory Usage

Issue: Memory usage is high after running for a while.

Solution:

Adjust vad.pool_size.
Reduce pool.worker_count.
Enable rate_limit to limit concurrent connections.

📊 Performance Tuning

Key Parameters

Parameter	Description	Recommended	Impact
`vad.pool_size`	VAD Instance Pool Size	200	Affects concurrency capacity
`recognition.num_threads`	ASR Threads	8-16	Affects recognition speed
`pool.worker_count`	Worker Goroutines	500	Affects max connections
`vad.threshold`	VAD Threshold	0.5	Affects detection sensitivity
`speaker.threshold`	Speaker Similarity Threshold	0.6	Affects identification accuracy

Optimization Tips

CPU Optimization: Adjust num_threads based on core count.
Memory Optimization: Reduce pool_size and worker_count.
Latency Optimization: Use ten_vad instead of silero_vad.
Concurrency Optimization: Enable rate_limit to prevent overload.

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License. However, please note:

If using ten-vad (vad.provider set to ten_vad), you must comply with ten-vad's License.
If using only silero-vad (vad.provider set to silero_vad), you can follow the MIT License directly.

Please comply with the corresponding open source license based on the VAD engine used.

🙏 Acknowledgements

This project is based on the following excellent open-source projects:

Sherpa-ONNX - Core Speech Recognition Engine
SenseVoice - Multi-language Speech Recognition Model
Silero VAD - Voice Activity Detection Model
ten-vad - Efficient Endpoint Detection Algorithm
3D-Speaker - Speaker Verification Model

📞 Contact & Support

If you have any questions or suggestions, feel free to:

📝 Submit an Issue
💬 Participate in Discussions
📧 Send an email: bbeyond.llove@gmail.com

⭐ Star History

If this project helps you, please give it a Star ⭐️!

_{Built with ❤️ by quyangminddock}

About

一个基于 Sherpa-ONNX 的高性能语音识别服务，支持实时 VAD（语音活动检测）、多语言识别和声纹识别。

772.00 KiB

1 forks 0 stars 1 branches 0 TagREADME

Release
0

Tag

Language

Go48.7%

Python30.7%

JavaScript12.1%

CSS4.4%

Others4.1%

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111

🎤 VoxID: VAD ASR Speech Recognition Server

✨ Features

📋 Requirements

Basic Requirements

Dependencies (Linux)

🚀 Quick Start

1. Clone the Project

2. Install Go Dependencies

3. Download Model Files

3.1 Download ASR Model (Required)

3.2 Download VAD Model (Required)

3.3 Download Speaker Verification Model (Optional)

4. Configure Dynamic Libraries (Linux)

5. Create Necessary Directories

6. Run the Service

7. Verify Service

⚙️ Configuration

VAD Configuration

Silero VAD (Default)

Ten-VAD

ASR Configuration

Speaker Verification Configuration

Server Configuration

🔌 API Usage

WebSocket API

HTTP API

Health Check

Status Monitoring

Speaker Registration

Speaker Recognition

🧪 Testing

Single File Test

Concurrent Stress Test

🏛️ System Architecture

📂 Project Structure

🔧 FAQ

1. Model Download Failure

2. Dynamic Library Load Failure

3. WebSocket Connection Failure

4. No Recognition Result

5. High Memory Usage

📊 Performance Tuning

Key Parameters

Optimization Tips

🤝 Contributing

📄 License

🙏 Acknowledgements

📞 Contact & Support

⭐ Star History

About

Release0

Release
0