WhisperX Usage Guide
Quick Start
1. Installation
# Install vLLM
pip install vllm
# Install WhisperX dependencies
pip install -r requirements-whisperx.txt
# For speaker diarization, set up HuggingFace authentication
export HF_TOKEN=your_huggingface_token
2. Basic Transcription
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
# Load model
llm = LLM(model="openai/whisper-large-v3")
# Load audio
audio_asset = AudioAsset("path/to/audio.wav")
audio = audio_asset.audio_and_sample_rate
# Create prompt
prompt = {
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {"audio": audio},
},
"decoder_prompt": "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>",
}
# Generate
outputs = llm.generate(prompt, sampling_params=SamplingParams(temperature=0.0))
print(outputs[0].outputs[0].text)
Advanced Features
Forced Alignment (Word-Level Timestamps)
from vllm import LLM
from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline
# Initialize
llm = LLM(model="openai/whisper-large-v3", trust_remote_code=True)
model = llm.llm_engine.model_executor.driver_worker.model_runner.model
# Create pipeline
pipeline = create_whisperx_pipeline(
model=model,
enable_alignment=True,
language="en"
)
# Transcribe
result = pipeline.transcribe("audio.wav")
# Access word timestamps
for segment in result["segments"]:
print(f"Segment: {segment['text']}")
for word in segment["words"]:
print(f" {word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")
Speaker Diarization
# Create pipeline with diarization
pipeline = create_whisperx_pipeline(
model=model,
enable_alignment=True,
enable_diarization=True,
language="en",
min_speakers=2,
max_speakers=4
)
# Transcribe
result = pipeline.transcribe("multi_speaker.wav")
# Access speaker labels
for segment in result["segments"]:
speaker = segment.get("speaker", "UNKNOWN")
print(f"[{speaker}] {segment['text']}")
Long Audio Files
WhisperX automatically chunks long audio files:
# Process long audio (>30 seconds)
result = pipeline.transcribe("long_audio.wav")
# Results are automatically merged
print(f"Total duration: {result['duration']:.2f}s")
print(f"Total segments: {len(result['segments'])}")
Language Support
Specify language for better accuracy:
# Transcribe in different languages
languages = ["en", "es", "fr", "de", "zh", "ja"]
for lang in languages:
result = pipeline.transcribe(f"audio_{lang}.wav", language=lang)
print(f"{lang}: {result['text']}")
Supported Languages
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Russian (ru)
- Chinese (zh)
- Japanese (ja)
- Korean (ko)
- Arabic (ar)
- And 20+ more...
Configuration
Pipeline Configuration
from vllm.model_executor.models.whisperx_pipeline import WhisperXConfig
config = WhisperXConfig(
# Alignment
enable_alignment=True,
alignment_model=None, # Auto-select
# Diarization
enable_diarization=True,
diarization_model="pyannote/speaker-diarization-3.1",
min_speakers=1,
max_speakers=10,
# Audio processing
chunk_length_s=30.0,
overlap_length_s=5.0,
# Performance
compute_type="float16",
device="cuda",
)
Batch Processing
# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = []
for audio_file in audio_files:
result = pipeline.transcribe(audio_file)
results.append(result)
# Cleanup memory
pipeline.cleanup()
Output Format
Basic Output
result = {
"text": "Full transcription",
"language": "en",
"duration": 120.5,
"segments": [...],
}
With Alignment
segment = {
"start": 0.0,
"end": 5.2,
"text": "Hello world",
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5, "score": 0.95},
{"word": "world", "start": 0.6, "end": 1.0, "score": 0.92},
]
}
With Diarization
segment = {
"start": 0.0,
"end": 5.2,
"text": "Hello world",
"speaker": "SPEAKER_00",
"words": [
{
"word": "Hello",
"start": 0.0,
"end": 0.5,
"score": 0.95,
"speaker": "SPEAKER_00"
},
]
}
Best Practices
1. Memory Management
# Use float16 for efficiency
config = WhisperXConfig(compute_type="float16")
# Clean up after processing
pipeline.cleanup()
# Or use context manager (if implemented)
with create_whisperx_pipeline(...) as pipeline:
result = pipeline.transcribe("audio.wav")
# Auto-cleanup
2. Error Handling
try:
result = pipeline.transcribe("audio.wav")
except Exception as e:
print(f"Transcription failed: {e}")
# Handle error
3. Quality Optimization
# For best quality:
# - Use larger models: whisper-large-v3
# - Enable alignment: enable_alignment=True
# - Specify language: language="en"
# - Use float32 if memory allows: compute_type="float32"
# For speed:
# - Use smaller models: whisper-base
# - Disable alignment: enable_alignment=False
# - Use float16: compute_type="float16"
Troubleshooting
Issue: Alignment fails
Solution: Ensure language is specified correctly:
pipeline = create_whisperx_pipeline(model=model, language="en")
Issue: Diarization requires HF token
Solution: Set environment variable:
export HF_TOKEN=your_token_here
Issue: Out of memory
Solution: Use float16 and cleanup:
config = WhisperXConfig(compute_type="float16")
pipeline.cleanup() # Call after each file
Examples
Complete examples are available in examples/offline_inference/:
whisperx_basic.py- Simple transcriptionwhisperx_alignment.py- With timestampswhisperx_diarization.py- With speaker labelswhisperx_batch.py- Batch processing
Next Steps
- Read the Architecture Guide for implementation details
- Check the API Reference for detailed API docs
- See the Deployment Guide for production setup