logo
1
0
WeChat Login
quyang<jintli@qq.com>
first commit

Chatterbox Demo - Modern Web Interface

Chatterbox Demo Banner

A modern, sci-fi themed web interface for the open-source Chatterbox TTS models by Resemble AI. Features a futuristic "camera wall" background with live webcam integration and a floating glassmorphism control panel.

✨ Features

  • 🎭 Dual Model Support: Switch between Chatterbox-Turbo (English, fast) and Chatterbox-Multilingual (23+ languages)
  • 🌍 Multi-language: Support for Chinese, Japanese, Korean, Spanish, French, German, Arabic, and more
  • 🎨 Sci-Fi UI: Dynamic camera grid background with live webcam feeds and CCTV-style effects
  • 🎙️ Voice Cloning: Upload 5-10 second audio samples to clone any voice
  • 🎵 Audio Preview: Listen to reference audio before generation
  • 🏷️ Paralinguistic Tags: Use [laugh], [cough], [sigh] for expressive speech
  • 💎 Glassmorphism Design: Modern, sleek floating interface with neon accents

📋 Prerequisites

  • Python 3.10+
  • Node.js 18+ and npm
  • 8GB+ RAM (16GB+ recommended for multilingual model)
  • Hugging Face account and token (for gated models)

🚀 Quick Start

1. Clone Repository

git clone https://github.com/quyangminddock/chatterbox_demo.git cd chatterbox_demo

2. Backend Setup

# Install Python dependencies pip install -e . # Additional dependencies for API pip install fastapi uvicorn python-multipart librosa soundfile # Set your Hugging Face token (get one from https://huggingface.co/settings/tokens) export HF_TOKEN=your_huggingface_token_here # Start the API server (loads both Turbo and Multilingual models) python api.py

The server will start on http://localhost:8000. Model loading takes 2-3 minutes on first run.

3. Frontend Setup

cd ui # Install dependencies npm install # Start development server npm run dev

Open http://localhost:3000 in your browser.

🎯 Usage

  1. Select Model: Choose between Turbo (English only, faster) or Multilingual (23+ languages)
  2. Choose Language: If using Multilingual, select your target language (e.g., "中文" for Chinese)
  3. Enter Text: Type or paste the text you want to synthesize
  4. Upload Audio (Optional): Upload a 5-10 second clear audio sample for voice cloning
  5. Preview Audio: Listen to your uploaded reference before generation
  6. Generate: Click "INITIATE_CLONE" to generate speech
  7. Play: Listen to the generated audio directly in the interface

Supported Languages (Multilingual Model)

Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Turkish

📁 Project Structure

chatterbox_demo/ ├── api.py # Unified FastAPI server (both models) ├── api_multilingual.py # Standalone multilingual API (optional) ├── ui/ # Next.js frontend │ ├── components/ │ │ ├── CameraBackground.tsx # Surveillance-style grid background │ │ └── FloatingHUD.tsx # Main control interface │ ├── app/ │ │ └── globals.css # Sci-fi theme styling │ └── public/ # CCTV images and assets ├── src/chatterbox/ # Modified Chatterbox source (dtype fixes) └── README.md

🔧 Technical Details

Backend (FastAPI)

  • Loads both Turbo and Multilingual models on startup
  • CPU-optimized with float32 dtype fixes for compatibility
  • Audio sanitization using librosa for consistent format
  • CORS enabled for local development

Frontend (Next.js + TypeScript)

  • React with TypeScript and Tailwind CSS
  • Framer Motion for smooth animations
  • Live webcam integration in corner cells
  • Model and language selection dropdowns
  • Audio preview and playback

Key Fixes Applied

  1. Float32 Dtype Consistency: Modified tts_turbo.py and mtl_tts.py to ensure float32 throughout pipeline
  2. S3Tokenizer Fix: Added .float() cast in mel spectrogram computation
  3. CPU Map Location: Added map_location='cpu' for model loading on non-CUDA devices
  4. Librosa Audio Loading: Switched from torchaudio to librosa for consistent audio handling

📸 Screenshots

UI Demo

🐛 Troubleshooting

Models not loading

  • Multilingual model fails: Ensure you have enough RAM (16GB+)
  • Token errors: Verify your HF_TOKEN is set and has access to gated models
  • map_location errors: Make sure you're using the modified source files with CPU fixes

Voice cloning not working

  • Use clear, 5-10 second audio samples
  • WAV format recommended
  • Avoid background noise in reference audio

UI issues

  • Check that both backend (port 8000) and frontend (port 3000) are running
  • Clear browser cache if encountering stale UI
  • Verify camera permissions if webcam feeds don't appear

📝 Credits

  • Chatterbox Models: Resemble AI
  • UI Design: Camera surveillance theme with sci-fi aesthetics
  • CCTV Images: Generated using AI for demonstration purposes

📄 License

This demo interface is released under the MIT License. The underlying Chatterbox models are licensed under Apache 2.0 by Resemble AI.

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

⭐ Acknowledgments

Special thanks to Resemble AI for open-sourcing the amazing Chatterbox TTS models!