whisperx-vllm / vllm /docs /whisperx_api.md
WhisperX Team
WhisperX-vLLM: Production-ready integration (HF release)
bf31d48

WhisperX API Reference

Core Components

WhisperXForConditionalGeneration

Main model class for WhisperX.

from vllm import LLM

llm = LLM(model="openai/whisper-large-v3")

Inherits from: WhisperForConditionalGeneration

Additional Features:

  • Support for forced alignment
  • Support for speaker diarization
  • Audio chunking for long files

WhisperXPipeline

High-level pipeline for complete transcription workflow.

from vllm.model_executor.models.whisperx_pipeline import WhisperXPipeline, WhisperXConfig

pipeline = WhisperXPipeline(model=model, config=config, language="en")

Constructor

WhisperXPipeline(
    model: WhisperXForConditionalGeneration,
    config: WhisperXConfig,
    language: Optional[str] = None
)

Parameters:

  • model: WhisperX model instance
  • config: Pipeline configuration
  • language: Default language code (e.g., "en", "es")

Methods

transcribe()

Transcribe audio with full pipeline.

result = pipeline.transcribe(
    audio: Union[str, np.ndarray, torch.Tensor],
    batch_size: int = 8,
    language: Optional[str] = None,
    task: str = "transcribe",
    **kwargs
) -> Dict

Parameters:

  • audio: Audio file path or waveform
  • batch_size: Batch size for processing
  • language: Override default language
  • task: "transcribe" or "translate"

Returns: Dictionary containing:

  • text: Full transcription
  • segments: List of segments with timestamps
  • word_segments: Word-level segments
  • language: Language code
  • duration: Audio duration in seconds
  • speaker_embeddings: (if diarization enabled) Speaker embeddings
cleanup()

Free memory by unloading auxiliary models.

pipeline.cleanup()

Configuration Classes

WhisperXConfig

Pipeline configuration.

from vllm.model_executor.models.whisperx_pipeline import WhisperXConfig

config = WhisperXConfig(
    enable_alignment=True,
    enable_diarization=False,
    chunk_length_s=30.0,
    overlap_length_s=5.0,
    compute_type="float16",
    device="cuda"
)

Parameters:

Alignment Settings

  • enable_alignment (bool): Enable forced alignment. Default: True
  • alignment_model (Optional[str]): Custom alignment model name. Default: None (auto-select)

Diarization Settings

  • enable_diarization (bool): Enable speaker diarization. Default: False
  • diarization_model (Optional[str]): Diarization model name. Default: None (uses pyannote/speaker-diarization-3.1)
  • min_speakers (Optional[int]): Minimum number of speakers. Default: None
  • max_speakers (Optional[int]): Maximum number of speakers. Default: None
  • num_speakers (Optional[int]): Exact number of speakers. Default: None

Audio Processing

  • chunk_length_s (float): Chunk length in seconds. Default: 30.0
  • overlap_length_s (float): Overlap between chunks. Default: 5.0

Performance

  • compute_type (str): Computation type. Options: "float16", "float32", "int8". Default: "float16"
  • device (str): Device to use. Options: "cuda", "cpu". Default: "cuda"

Audio Processing

AudioChunker

Handles chunking of long audio files.

from vllm.model_executor.models.whisperx_audio import AudioChunker

chunker = AudioChunker(
    chunk_length_s=30.0,
    overlap_length_s=5.0,
    sample_rate=16000
)

Methods

chunk()

Split audio into chunks.

chunks = chunker.chunk(audio: Union[np.ndarray, torch.Tensor]) -> List[AudioChunk]

Returns: List of AudioChunk objects with:

  • audio: Audio samples
  • start_time: Start time in seconds
  • end_time: End time in seconds
  • chunk_idx: Chunk index
  • is_last: Whether this is the last chunk
merge_chunk_results()

Merge transcription results from multiple chunks.

merged = chunker.merge_chunk_results(
    chunk_results: List[dict],
    chunks: List[AudioChunk]
) -> dict

Audio Utilities

load_audio()

Load audio file.

from vllm.model_executor.models.whisperx_audio import load_audio

audio = load_audio(file_path: str, sr: int = 16000) -> np.ndarray

pad_or_trim()

Pad or trim audio to specified length.

from vllm.model_executor.models.whisperx_audio import pad_or_trim

audio = pad_or_trim(
    array: Union[np.ndarray, torch.Tensor],
    length: int = 480000,
    axis: int = -1
) -> Union[np.ndarray, torch.Tensor]

get_audio_duration()

Get audio duration.

from vllm.model_executor.models.whisperx_audio import get_audio_duration

duration = get_audio_duration(
    audio: Union[np.ndarray, torch.Tensor],
    sr: int = 16000
) -> float

Alignment

AlignmentModel

Forced alignment using Wav2Vec2.

from vllm.model_executor.models.whisperx_alignment import load_align_model

alignment_model = load_align_model(
    language_code="en",
    device="cuda",
    model_name=None,  # Auto-select
    model_dir=None
)

Methods

align()

Perform forced alignment.

result = alignment_model.align(
    transcript_segments: List[dict],
    audio: Union[np.ndarray, torch.Tensor],
    return_char_alignments: bool = False
) -> dict

Parameters:

  • transcript_segments: List of segments with text, start, end
  • audio: Audio waveform
  • return_char_alignments: Include character-level alignments

Returns: Dictionary with:

  • segments: Aligned segments with word-level timestamps
  • word_segments: Flat list of all words

Diarization

DiarizationModel

Speaker diarization using pyannote.

from vllm.model_executor.models.whisperx_diarization import load_diarization_model

diarization_model = load_diarization_model(
    model_name="pyannote/speaker-diarization-3.1",
    use_auth_token=None,  # Uses HF_TOKEN from environment
    device="cuda"
)

Methods

call()

Perform speaker diarization.

result = diarization_model(
    audio: Union[str, np.ndarray, torch.Tensor],
    num_speakers: Optional[int] = None,
    min_speakers: Optional[int] = None,
    max_speakers: Optional[int] = None,
    return_embeddings: bool = False
) -> Union[pd.DataFrame, tuple]

Parameters:

  • audio: Audio path or waveform
  • num_speakers: Exact number of speakers
  • min_speakers: Minimum speakers
  • max_speakers: Maximum speakers
  • return_embeddings: Return speaker embeddings

Returns: Dataframe with columns:

  • segment: Segment object
  • label: Internal label
  • speaker: Speaker ID
  • start: Start time
  • end: End time

Or tuple of (dataframe, embeddings) if return_embeddings=True.

assign_word_speakers()

Assign speakers to transcription.

from vllm.model_executor.models.whisperx_diarization import assign_word_speakers

result = assign_word_speakers(
    diarize_df: pd.DataFrame,
    transcript_result: dict,
    speaker_embeddings: Optional[Dict] = None,
    fill_nearest: bool = False
) -> dict

Parameters:

  • diarize_df: Diarization dataframe
  • transcript_result: Transcription result
  • speaker_embeddings: Optional embeddings
  • fill_nearest: Assign speakers even without overlap

Returns: Updated transcript with speaker labels.

Factory Functions

create_whisperx_pipeline()

Create WhisperX pipeline with default configuration.

from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline

pipeline = create_whisperx_pipeline(
    model: WhisperXForConditionalGeneration,
    enable_alignment: bool = True,
    enable_diarization: bool = False,
    language: Optional[str] = None,
    **kwargs
) -> WhisperXPipeline

Parameters:

  • model: WhisperX model instance
  • enable_alignment: Enable forced alignment
  • enable_diarization: Enable speaker diarization
  • language: Default language
  • **kwargs: Additional config parameters

Returns: Configured WhisperXPipeline instance.

Data Structures

AudioChunk

@dataclass
class AudioChunk:
    audio: np.ndarray  # Audio samples
    start_time: float  # Start time in seconds
    end_time: float    # End time in seconds
    chunk_idx: int     # Chunk index
    is_last: bool      # Last chunk flag

Segment

segment = {
    "start": float,        # Start time
    "end": float,          # End time
    "text": str,           # Transcribed text
    "speaker": str,        # Speaker ID (if diarization enabled)
    "words": List[Word]    # Word-level data (if alignment enabled)
}

Word

word = {
    "word": str,        # Word text
    "start": float,     # Start time
    "end": float,       # End time
    "score": float,     # Confidence score
    "speaker": str      # Speaker ID (if diarization enabled)
}

Constants

# Audio processing constants
SAMPLE_RATE = 16000         # Target sample rate
CHUNK_LENGTH_S = 30         # Default chunk length
OVERLAP_LENGTH_S = 5        # Default overlap
N_SAMPLES_PER_CHUNK = 480000  # Samples per 30s chunk

Type Hints

from typing import Union, Optional, Dict, List
import numpy as np
import torch

AudioInput = Union[str, np.ndarray, torch.Tensor]
TranscriptionResult = Dict[str, Any]
SegmentList = List[Dict[str, Any]]

Error Handling

All methods may raise:

  • ValueError: Invalid parameters
  • RuntimeError: Model loading or inference errors
  • FileNotFoundError: Audio file not found
  • ImportError: Missing dependencies

Example:

try:
    result = pipeline.transcribe("audio.wav")
except ValueError as e:
    print(f"Invalid audio: {e}")
except RuntimeError as e:
    print(f"Transcription failed: {e}")

Version Compatibility

  • vLLM: >=0.6.0
  • PyTorch: >=2.0.0
  • Transformers: >=4.30.0
  • pyannote.audio: >=3.1.0 (for diarization)