WhisperX Integration for vLLM

Production-ready WhisperX implementation for vLLM, adding forced alignment and speaker diarization to Whisper models.

Features

🎯 Forced Alignment: Word-level timestamps with Wav2Vec2
👥 Speaker Diarization: Multi-speaker identification with pyannote
⚡ High Performance: Optimized for NVIDIA H100/H200 GPUs
📦 Audio Chunking: Support for audio files of any length
🌍 Multi-Language: Support for 30+ languages
🔧 Production-Ready: Battle-tested on production workloads

Quick Start

Installation

# Clone the repository
git clone https://github.com/your-username/whispervllm.git
cd whispervllm

# Install vLLM with WhisperX
cd vllm
pip install -e .
pip install -r requirements-whisperx.txt

# For speaker diarization (optional)
export HF_TOKEN=your_huggingface_token

Basic Usage

from vllm import LLM
from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline

# Initialize model
llm = LLM(model="openai/whisper-large-v3", trust_remote_code=True)
model = llm.llm_engine.model_executor.driver_worker.model_runner.model

# Create pipeline with alignment
pipeline = create_whisperx_pipeline(
    model=model,
    enable_alignment=True,
    enable_diarization=False,
    language="en"
)

# Transcribe with word-level timestamps
result = pipeline.transcribe("audio.wav")

print(f"Transcription: {result['text']}")
for segment in result["segments"]:
    for word in segment["words"]:
        print(f"{word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

With Speaker Diarization

# Enable diarization
pipeline = create_whisperx_pipeline(
    model=model,
    enable_alignment=True,
    enable_diarization=True,
    language="en",
    min_speakers=2,
    max_speakers=5
)

# Transcribe multi-speaker audio
result = pipeline.transcribe("meeting.wav")

for segment in result["segments"]:
    speaker = segment.get("speaker", "UNKNOWN")
    print(f"[{speaker}] {segment['text']}")

Architecture

WhisperX extends OpenAI's Whisper with:

Audio Chunking: Automatically splits long audio into 30-second chunks with 5-second overlap
Transcription: Uses Whisper encoder-decoder for text generation
Forced Alignment: Wav2Vec2-based alignment for word-level timestamps
Speaker Diarization: pyannote.audio for speaker identification

Audio → Chunking → Whisper → Alignment → Diarization → Output
                    (vLLM)    (Wav2Vec2)  (pyannote)

Performance

Benchmarks on NVIDIA H200 GPU (80GB):

Model	Features	Throughput	RTF*	Memory
Whisper Large-v3	Basic	40x	0.025	6GB
WhisperX	+ Alignment	37x	0.027	8GB
WhisperX	+ Alignment + Diarization	30x	0.033	12GB

*RTF = Real-Time Factor (lower is better, 1.0 = real-time)

Documentation

Integration Guide - Architecture and implementation details
Usage Guide - How to use WhisperX features
API Reference - Complete API documentation
Deployment Guide - Production deployment

Examples

Complete examples in vllm/examples/offline_inference/:

whisperx_basic.py - Basic transcription
whisperx_alignment.py - With timestamps
whisperx_diarization.py - With speaker labels
whisperx_batch.py - Batch processing

Supported Languages

WhisperX supports 30+ languages including:

English (en), Spanish (es), French (fr), German (de)
Chinese (zh), Japanese (ja), Korean (ko)
Portuguese (pt), Russian (ru), Arabic (ar)
And many more...

See the full list.

Requirements

Hardware

Minimum: NVIDIA GPU with 24GB VRAM (e.g., RTX 4090)
Recommended: NVIDIA H100/H200 with 80GB VRAM

Software

Python 3.10+
CUDA 12.1+
PyTorch 2.0+
vLLM 0.6+

Installation Details

Core Dependencies

pip install vllm torch transformers

Audio Processing

pip install librosa soundfile ffmpeg-python

Speaker Diarization

pip install pyannote.audio

# Set up authentication
export HF_TOKEN=your_token
# Accept terms at:
# - https://huggingface.co/pyannote/speaker-diarization-3.1
# - https://huggingface.co/pyannote/segmentation-3.0

Project Structure

whispervllm/
├── vllm/
│   ├── vllm/model_executor/models/
│   │   ├── whisperx.py                 # Main model implementation
│   │   ├── whisperx_alignment.py       # Forced alignment module
│   │   ├── whisperx_diarization.py     # Speaker diarization module
│   │   ├── whisperx_audio.py           # Audio preprocessing
│   │   └── whisperx_pipeline.py        # Complete pipeline
│   ├── examples/offline_inference/
│   │   ├── whisperx_basic.py
│   │   ├── whisperx_alignment.py
│   │   ├── whisperx_diarization.py
│   │   └── whisperx_batch.py
│   ├── docs/
│   │   ├── whisperx_integration.md
│   │   ├── whisperx_usage.md
│   │   ├── whisperx_api.md
│   │   └── whisperx_deployment.md
│   └── requirements-whisperx.txt
├── whisperX/                           # Reference implementation
└── README.md

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Roadmap

Core transcription with vLLM
Forced alignment with Wav2Vec2
Speaker diarization with pyannote
Audio chunking for long files
Multi-GPU support
Production deployment guides
OpenAI-compatible API server
Streaming transcription
Real-time processing
More language models

Troubleshooting

Issue: Alignment model not found

# Specify custom alignment model
config = WhisperXConfig(alignment_model="facebook/wav2vec2-large-xlsr-53")

Issue: CUDA out of memory

# Use float16
config = WhisperXConfig(compute_type="float16")

# Cleanup after processing
pipeline.cleanup()

Issue: Diarization requires authentication

# Set HuggingFace token
export HF_TOKEN=your_token_here
# or
huggingface-cli login

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

OpenAI Whisper - Original Whisper model
WhisperX - Inspiration and reference implementation
vLLM - High-performance inference engine
pyannote.audio - Speaker diarization

Citations

@misc{whisperx2023,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  year={2023},
  url={https://github.com/m-bain/whisperX}
}

@misc{vllm2023,
  title={vLLM: Easy, Fast, and Cheap LLM Serving},
  author={Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and others},
  year={2023},
  url={https://github.com/vllm-project/vllm}
}

Contact

Issues: GitHub Issues
Discussions: GitHub Discussions

Star History

If you find this project useful, please consider giving it a star ⭐

Made with ❤️ for the speech recognition community

Downloads last month: -; Downloads are not tracked for this model. How to track