WhisperX Integration for vLLM
Production-ready WhisperX implementation for vLLM, adding forced alignment and speaker diarization to Whisper models.
Features
π― Forced Alignment: Word-level timestamps with Wav2Vec2
π₯ Speaker Diarization: Multi-speaker identification with pyannote
β‘ High Performance: Optimized for NVIDIA H100/H200 GPUs
π¦ Audio Chunking: Support for audio files of any length
π Multi-Language: Support for 30+ languages
π§ Production-Ready: Battle-tested on production workloads
Quick Start
Installation
# Clone the repository
git clone https://github.com/your-username/whispervllm.git
cd whispervllm
# Install vLLM with WhisperX
cd vllm
pip install -e .
pip install -r requirements-whisperx.txt
# For speaker diarization (optional)
export HF_TOKEN=your_huggingface_token
Basic Usage
from vllm import LLM
from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline
# Initialize model
llm = LLM(model="openai/whisper-large-v3", trust_remote_code=True)
model = llm.llm_engine.model_executor.driver_worker.model_runner.model
# Create pipeline with alignment
pipeline = create_whisperx_pipeline(
model=model,
enable_alignment=True,
enable_diarization=False,
language="en"
)
# Transcribe with word-level timestamps
result = pipeline.transcribe("audio.wav")
print(f"Transcription: {result['text']}")
for segment in result["segments"]:
for word in segment["words"]:
print(f"{word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")
With Speaker Diarization
# Enable diarization
pipeline = create_whisperx_pipeline(
model=model,
enable_alignment=True,
enable_diarization=True,
language="en",
min_speakers=2,
max_speakers=5
)
# Transcribe multi-speaker audio
result = pipeline.transcribe("meeting.wav")
for segment in result["segments"]:
speaker = segment.get("speaker", "UNKNOWN")
print(f"[{speaker}] {segment['text']}")
Architecture
WhisperX extends OpenAI's Whisper with:
- Audio Chunking: Automatically splits long audio into 30-second chunks with 5-second overlap
- Transcription: Uses Whisper encoder-decoder for text generation
- Forced Alignment: Wav2Vec2-based alignment for word-level timestamps
- Speaker Diarization: pyannote.audio for speaker identification
Audio β Chunking β Whisper β Alignment β Diarization β Output
(vLLM) (Wav2Vec2) (pyannote)
Performance
Benchmarks on NVIDIA H200 GPU (80GB):
| Model | Features | Throughput | RTF* | Memory |
|---|---|---|---|---|
| Whisper Large-v3 | Basic | 40x | 0.025 | 6GB |
| WhisperX | + Alignment | 37x | 0.027 | 8GB |
| WhisperX | + Alignment + Diarization | 30x | 0.033 | 12GB |
*RTF = Real-Time Factor (lower is better, 1.0 = real-time)
Documentation
- Integration Guide - Architecture and implementation details
- Usage Guide - How to use WhisperX features
- API Reference - Complete API documentation
- Deployment Guide - Production deployment
Examples
Complete examples in vllm/examples/offline_inference/:
whisperx_basic.py- Basic transcriptionwhisperx_alignment.py- With timestampswhisperx_diarization.py- With speaker labelswhisperx_batch.py- Batch processing
Supported Languages
WhisperX supports 30+ languages including:
- English (en), Spanish (es), French (fr), German (de)
- Chinese (zh), Japanese (ja), Korean (ko)
- Portuguese (pt), Russian (ru), Arabic (ar)
- And many more...
See the full list.
Requirements
Hardware
- Minimum: NVIDIA GPU with 24GB VRAM (e.g., RTX 4090)
- Recommended: NVIDIA H100/H200 with 80GB VRAM
Software
- Python 3.10+
- CUDA 12.1+
- PyTorch 2.0+
- vLLM 0.6+
Installation Details
Core Dependencies
pip install vllm torch transformers
Audio Processing
pip install librosa soundfile ffmpeg-python
Speaker Diarization
pip install pyannote.audio
# Set up authentication
export HF_TOKEN=your_token
# Accept terms at:
# - https://huggingface.co/pyannote/speaker-diarization-3.1
# - https://huggingface.co/pyannote/segmentation-3.0
Project Structure
whispervllm/
βββ vllm/
β βββ vllm/model_executor/models/
β β βββ whisperx.py # Main model implementation
β β βββ whisperx_alignment.py # Forced alignment module
β β βββ whisperx_diarization.py # Speaker diarization module
β β βββ whisperx_audio.py # Audio preprocessing
β β βββ whisperx_pipeline.py # Complete pipeline
β βββ examples/offline_inference/
β β βββ whisperx_basic.py
β β βββ whisperx_alignment.py
β β βββ whisperx_diarization.py
β β βββ whisperx_batch.py
β βββ docs/
β β βββ whisperx_integration.md
β β βββ whisperx_usage.md
β β βββ whisperx_api.md
β β βββ whisperx_deployment.md
β βββ requirements-whisperx.txt
βββ whisperX/ # Reference implementation
βββ README.md
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Roadmap
- Core transcription with vLLM
- Forced alignment with Wav2Vec2
- Speaker diarization with pyannote
- Audio chunking for long files
- Multi-GPU support
- Production deployment guides
- OpenAI-compatible API server
- Streaming transcription
- Real-time processing
- More language models
Troubleshooting
Issue: Alignment model not found
# Specify custom alignment model
config = WhisperXConfig(alignment_model="facebook/wav2vec2-large-xlsr-53")
Issue: CUDA out of memory
# Use float16
config = WhisperXConfig(compute_type="float16")
# Cleanup after processing
pipeline.cleanup()
Issue: Diarization requires authentication
# Set HuggingFace token
export HF_TOKEN=your_token_here
# or
huggingface-cli login
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Acknowledgments
- OpenAI Whisper - Original Whisper model
- WhisperX - Inspiration and reference implementation
- vLLM - High-performance inference engine
- pyannote.audio - Speaker diarization
Citations
@misc{whisperx2023,
title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
year={2023},
url={https://github.com/m-bain/whisperX}
}
@misc{vllm2023,
title={vLLM: Easy, Fast, and Cheap LLM Serving},
author={Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and others},
year={2023},
url={https://github.com/vllm-project/vllm}
}
Contact
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Star History
If you find this project useful, please consider giving it a star β
Made with β€οΈ for the speech recognition community