WhisperX Integration for vLLM

License Python 3.10+ CUDA 12.1+

Production-ready WhisperX implementation for vLLM, adding forced alignment and speaker diarization to Whisper models.

Features

🎯 Forced Alignment: Word-level timestamps with Wav2Vec2
πŸ‘₯ Speaker Diarization: Multi-speaker identification with pyannote
⚑ High Performance: Optimized for NVIDIA H100/H200 GPUs
πŸ“¦ Audio Chunking: Support for audio files of any length
🌍 Multi-Language: Support for 30+ languages
πŸ”§ Production-Ready: Battle-tested on production workloads

Quick Start

Installation

# Clone the repository
git clone https://github.com/your-username/whispervllm.git
cd whispervllm

# Install vLLM with WhisperX
cd vllm
pip install -e .
pip install -r requirements-whisperx.txt

# For speaker diarization (optional)
export HF_TOKEN=your_huggingface_token

Basic Usage

from vllm import LLM
from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline

# Initialize model
llm = LLM(model="openai/whisper-large-v3", trust_remote_code=True)
model = llm.llm_engine.model_executor.driver_worker.model_runner.model

# Create pipeline with alignment
pipeline = create_whisperx_pipeline(
    model=model,
    enable_alignment=True,
    enable_diarization=False,
    language="en"
)

# Transcribe with word-level timestamps
result = pipeline.transcribe("audio.wav")

print(f"Transcription: {result['text']}")
for segment in result["segments"]:
    for word in segment["words"]:
        print(f"{word['word']} [{word['start']:.2f}s - {word['end']:.2f}s]")

With Speaker Diarization

# Enable diarization
pipeline = create_whisperx_pipeline(
    model=model,
    enable_alignment=True,
    enable_diarization=True,
    language="en",
    min_speakers=2,
    max_speakers=5
)

# Transcribe multi-speaker audio
result = pipeline.transcribe("meeting.wav")

for segment in result["segments"]:
    speaker = segment.get("speaker", "UNKNOWN")
    print(f"[{speaker}] {segment['text']}")

Architecture

WhisperX extends OpenAI's Whisper with:

  1. Audio Chunking: Automatically splits long audio into 30-second chunks with 5-second overlap
  2. Transcription: Uses Whisper encoder-decoder for text generation
  3. Forced Alignment: Wav2Vec2-based alignment for word-level timestamps
  4. Speaker Diarization: pyannote.audio for speaker identification
Audio β†’ Chunking β†’ Whisper β†’ Alignment β†’ Diarization β†’ Output
                    (vLLM)    (Wav2Vec2)  (pyannote)

Performance

Benchmarks on NVIDIA H200 GPU (80GB):

Model Features Throughput RTF* Memory
Whisper Large-v3 Basic 40x 0.025 6GB
WhisperX + Alignment 37x 0.027 8GB
WhisperX + Alignment + Diarization 30x 0.033 12GB

*RTF = Real-Time Factor (lower is better, 1.0 = real-time)

Documentation

Examples

Complete examples in vllm/examples/offline_inference/:

Supported Languages

WhisperX supports 30+ languages including:

  • English (en), Spanish (es), French (fr), German (de)
  • Chinese (zh), Japanese (ja), Korean (ko)
  • Portuguese (pt), Russian (ru), Arabic (ar)
  • And many more...

See the full list.

Requirements

Hardware

  • Minimum: NVIDIA GPU with 24GB VRAM (e.g., RTX 4090)
  • Recommended: NVIDIA H100/H200 with 80GB VRAM

Software

  • Python 3.10+
  • CUDA 12.1+
  • PyTorch 2.0+
  • vLLM 0.6+

Installation Details

Core Dependencies

pip install vllm torch transformers

Audio Processing

pip install librosa soundfile ffmpeg-python

Speaker Diarization

pip install pyannote.audio

# Set up authentication
export HF_TOKEN=your_token
# Accept terms at:
# - https://huggingface.co/pyannote/speaker-diarization-3.1
# - https://huggingface.co/pyannote/segmentation-3.0

Project Structure

whispervllm/
β”œβ”€β”€ vllm/
β”‚   β”œβ”€β”€ vllm/model_executor/models/
β”‚   β”‚   β”œβ”€β”€ whisperx.py                 # Main model implementation
β”‚   β”‚   β”œβ”€β”€ whisperx_alignment.py       # Forced alignment module
β”‚   β”‚   β”œβ”€β”€ whisperx_diarization.py     # Speaker diarization module
β”‚   β”‚   β”œβ”€β”€ whisperx_audio.py           # Audio preprocessing
β”‚   β”‚   └── whisperx_pipeline.py        # Complete pipeline
β”‚   β”œβ”€β”€ examples/offline_inference/
β”‚   β”‚   β”œβ”€β”€ whisperx_basic.py
β”‚   β”‚   β”œβ”€β”€ whisperx_alignment.py
β”‚   β”‚   β”œβ”€β”€ whisperx_diarization.py
β”‚   β”‚   └── whisperx_batch.py
β”‚   β”œβ”€β”€ docs/
β”‚   β”‚   β”œβ”€β”€ whisperx_integration.md
β”‚   β”‚   β”œβ”€β”€ whisperx_usage.md
β”‚   β”‚   β”œβ”€β”€ whisperx_api.md
β”‚   β”‚   └── whisperx_deployment.md
β”‚   └── requirements-whisperx.txt
β”œβ”€β”€ whisperX/                           # Reference implementation
└── README.md

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Roadmap

  • Core transcription with vLLM
  • Forced alignment with Wav2Vec2
  • Speaker diarization with pyannote
  • Audio chunking for long files
  • Multi-GPU support
  • Production deployment guides
  • OpenAI-compatible API server
  • Streaming transcription
  • Real-time processing
  • More language models

Troubleshooting

Issue: Alignment model not found

# Specify custom alignment model
config = WhisperXConfig(alignment_model="facebook/wav2vec2-large-xlsr-53")

Issue: CUDA out of memory

# Use float16
config = WhisperXConfig(compute_type="float16")

# Cleanup after processing
pipeline.cleanup()

Issue: Diarization requires authentication

# Set HuggingFace token
export HF_TOKEN=your_token_here
# or
huggingface-cli login

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Citations

@misc{whisperx2023,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  year={2023},
  url={https://github.com/m-bain/whisperX}
}

@misc{vllm2023,
  title={vLLM: Easy, Fast, and Cheap LLM Serving},
  author={Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and others},
  year={2023},
  url={https://github.com/vllm-project/vllm}
}

Contact

Star History

If you find this project useful, please consider giving it a star ⭐


Made with ❀️ for the speech recognition community

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support