whisperx-vllm / vllm /docs /whisperx_usage.md
WhisperX Team
WhisperX-vLLM: Production-ready integration (HF release)
bf31d48

WhisperX Usage Guide

Quick Start

1. Installation

# Install vLLM
pip install vllm

# Install WhisperX dependencies
pip install -r requirements-whisperx.txt

# For speaker diarization, set up HuggingFace authentication
export HF_TOKEN=your_huggingface_token

2. Basic Transcription

from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

# Load model
llm = LLM(model="openai/whisper-large-v3")

# Load audio
audio_asset = AudioAsset("path/to/audio.wav")
audio = audio_asset.audio_and_sample_rate

# Create prompt
prompt = {
    "encoder_prompt": {
        "prompt": "",
        "multi_modal_data": {"audio": audio},
    },
    "decoder_prompt": "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>",
}

# Generate
outputs = llm.generate(prompt, sampling_params=SamplingParams(temperature=0.0))
print(outputs[0].outputs[0].text)

Advanced Features

Forced Alignment (Word-Level Timestamps)

from vllm import LLM
from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline

# Initialize
llm = LLM(model="openai/whisper-large-v3", trust_remote_code=True)
model = llm.llm_engine.model_executor.driver_worker.model_runner.model

# Create pipeline
pipeline = create_whisperx_pipeline(
    model=model,
    enable_alignment=True,
    language="en"
)

# Transcribe
result = pipeline.transcribe("audio.wav")

# Access word timestamps
for segment in result["segments"]:
    print(f"Segment: {segment['text']}")
    for word in segment["words"]:
        print(f"  {word['word']}: {word['start']:.2f}s - {word['end']:.2f}s")

Speaker Diarization

# Create pipeline with diarization
pipeline = create_whisperx_pipeline(
    model=model,
    enable_alignment=True,
    enable_diarization=True,
    language="en",
    min_speakers=2,
    max_speakers=4
)

# Transcribe
result = pipeline.transcribe("multi_speaker.wav")

# Access speaker labels
for segment in result["segments"]:
    speaker = segment.get("speaker", "UNKNOWN")
    print(f"[{speaker}] {segment['text']}")

Long Audio Files

WhisperX automatically chunks long audio files:

# Process long audio (>30 seconds)
result = pipeline.transcribe("long_audio.wav")

# Results are automatically merged
print(f"Total duration: {result['duration']:.2f}s")
print(f"Total segments: {len(result['segments'])}")

Language Support

Specify language for better accuracy:

# Transcribe in different languages
languages = ["en", "es", "fr", "de", "zh", "ja"]

for lang in languages:
    result = pipeline.transcribe(f"audio_{lang}.wav", language=lang)
    print(f"{lang}: {result['text']}")

Supported Languages

  • English (en)
  • Spanish (es)
  • French (fr)
  • German (de)
  • Italian (it)
  • Portuguese (pt)
  • Russian (ru)
  • Chinese (zh)
  • Japanese (ja)
  • Korean (ko)
  • Arabic (ar)
  • And 20+ more...

Configuration

Pipeline Configuration

from vllm.model_executor.models.whisperx_pipeline import WhisperXConfig

config = WhisperXConfig(
    # Alignment
    enable_alignment=True,
    alignment_model=None,  # Auto-select
    
    # Diarization
    enable_diarization=True,
    diarization_model="pyannote/speaker-diarization-3.1",
    min_speakers=1,
    max_speakers=10,
    
    # Audio processing
    chunk_length_s=30.0,
    overlap_length_s=5.0,
    
    # Performance
    compute_type="float16",
    device="cuda",
)

Batch Processing

# Process multiple files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]

results = []
for audio_file in audio_files:
    result = pipeline.transcribe(audio_file)
    results.append(result)

# Cleanup memory
pipeline.cleanup()

Output Format

Basic Output

result = {
    "text": "Full transcription",
    "language": "en",
    "duration": 120.5,
    "segments": [...],
}

With Alignment

segment = {
    "start": 0.0,
    "end": 5.2,
    "text": "Hello world",
    "words": [
        {"word": "Hello", "start": 0.0, "end": 0.5, "score": 0.95},
        {"word": "world", "start": 0.6, "end": 1.0, "score": 0.92},
    ]
}

With Diarization

segment = {
    "start": 0.0,
    "end": 5.2,
    "text": "Hello world",
    "speaker": "SPEAKER_00",
    "words": [
        {
            "word": "Hello",
            "start": 0.0,
            "end": 0.5,
            "score": 0.95,
            "speaker": "SPEAKER_00"
        },
    ]
}

Best Practices

1. Memory Management

# Use float16 for efficiency
config = WhisperXConfig(compute_type="float16")

# Clean up after processing
pipeline.cleanup()

# Or use context manager (if implemented)
with create_whisperx_pipeline(...) as pipeline:
    result = pipeline.transcribe("audio.wav")
# Auto-cleanup

2. Error Handling

try:
    result = pipeline.transcribe("audio.wav")
except Exception as e:
    print(f"Transcription failed: {e}")
    # Handle error

3. Quality Optimization

# For best quality:
# - Use larger models: whisper-large-v3
# - Enable alignment: enable_alignment=True
# - Specify language: language="en"
# - Use float32 if memory allows: compute_type="float32"

# For speed:
# - Use smaller models: whisper-base
# - Disable alignment: enable_alignment=False
# - Use float16: compute_type="float16"

Troubleshooting

Issue: Alignment fails

Solution: Ensure language is specified correctly:

pipeline = create_whisperx_pipeline(model=model, language="en")

Issue: Diarization requires HF token

Solution: Set environment variable:

export HF_TOKEN=your_token_here

Issue: Out of memory

Solution: Use float16 and cleanup:

config = WhisperXConfig(compute_type="float16")
pipeline.cleanup()  # Call after each file

Examples

Complete examples are available in examples/offline_inference/:

  • whisperx_basic.py - Simple transcription
  • whisperx_alignment.py - With timestamps
  • whisperx_diarization.py - With speaker labels
  • whisperx_batch.py - Batch processing

Next Steps