WhisperX API Reference
Core Components
WhisperXForConditionalGeneration
Main model class for WhisperX.
from vllm import LLM
llm = LLM(model="openai/whisper-large-v3")
Inherits from: WhisperForConditionalGeneration
Additional Features:
- Support for forced alignment
- Support for speaker diarization
- Audio chunking for long files
WhisperXPipeline
High-level pipeline for complete transcription workflow.
from vllm.model_executor.models.whisperx_pipeline import WhisperXPipeline, WhisperXConfig
pipeline = WhisperXPipeline(model=model, config=config, language="en")
Constructor
WhisperXPipeline(
model: WhisperXForConditionalGeneration,
config: WhisperXConfig,
language: Optional[str] = None
)
Parameters:
model: WhisperX model instanceconfig: Pipeline configurationlanguage: Default language code (e.g., "en", "es")
Methods
transcribe()
Transcribe audio with full pipeline.
result = pipeline.transcribe(
audio: Union[str, np.ndarray, torch.Tensor],
batch_size: int = 8,
language: Optional[str] = None,
task: str = "transcribe",
**kwargs
) -> Dict
Parameters:
audio: Audio file path or waveformbatch_size: Batch size for processinglanguage: Override default languagetask: "transcribe" or "translate"
Returns: Dictionary containing:
text: Full transcriptionsegments: List of segments with timestampsword_segments: Word-level segmentslanguage: Language codeduration: Audio duration in secondsspeaker_embeddings: (if diarization enabled) Speaker embeddings
cleanup()
Free memory by unloading auxiliary models.
pipeline.cleanup()
Configuration Classes
WhisperXConfig
Pipeline configuration.
from vllm.model_executor.models.whisperx_pipeline import WhisperXConfig
config = WhisperXConfig(
enable_alignment=True,
enable_diarization=False,
chunk_length_s=30.0,
overlap_length_s=5.0,
compute_type="float16",
device="cuda"
)
Parameters:
Alignment Settings
enable_alignment(bool): Enable forced alignment. Default:Truealignment_model(Optional[str]): Custom alignment model name. Default:None(auto-select)
Diarization Settings
enable_diarization(bool): Enable speaker diarization. Default:Falsediarization_model(Optional[str]): Diarization model name. Default:None(uses pyannote/speaker-diarization-3.1)min_speakers(Optional[int]): Minimum number of speakers. Default:Nonemax_speakers(Optional[int]): Maximum number of speakers. Default:Nonenum_speakers(Optional[int]): Exact number of speakers. Default:None
Audio Processing
chunk_length_s(float): Chunk length in seconds. Default:30.0overlap_length_s(float): Overlap between chunks. Default:5.0
Performance
compute_type(str): Computation type. Options: "float16", "float32", "int8". Default:"float16"device(str): Device to use. Options: "cuda", "cpu". Default:"cuda"
Audio Processing
AudioChunker
Handles chunking of long audio files.
from vllm.model_executor.models.whisperx_audio import AudioChunker
chunker = AudioChunker(
chunk_length_s=30.0,
overlap_length_s=5.0,
sample_rate=16000
)
Methods
chunk()
Split audio into chunks.
chunks = chunker.chunk(audio: Union[np.ndarray, torch.Tensor]) -> List[AudioChunk]
Returns: List of AudioChunk objects with:
audio: Audio samplesstart_time: Start time in secondsend_time: End time in secondschunk_idx: Chunk indexis_last: Whether this is the last chunk
merge_chunk_results()
Merge transcription results from multiple chunks.
merged = chunker.merge_chunk_results(
chunk_results: List[dict],
chunks: List[AudioChunk]
) -> dict
Audio Utilities
load_audio()
Load audio file.
from vllm.model_executor.models.whisperx_audio import load_audio
audio = load_audio(file_path: str, sr: int = 16000) -> np.ndarray
pad_or_trim()
Pad or trim audio to specified length.
from vllm.model_executor.models.whisperx_audio import pad_or_trim
audio = pad_or_trim(
array: Union[np.ndarray, torch.Tensor],
length: int = 480000,
axis: int = -1
) -> Union[np.ndarray, torch.Tensor]
get_audio_duration()
Get audio duration.
from vllm.model_executor.models.whisperx_audio import get_audio_duration
duration = get_audio_duration(
audio: Union[np.ndarray, torch.Tensor],
sr: int = 16000
) -> float
Alignment
AlignmentModel
Forced alignment using Wav2Vec2.
from vllm.model_executor.models.whisperx_alignment import load_align_model
alignment_model = load_align_model(
language_code="en",
device="cuda",
model_name=None, # Auto-select
model_dir=None
)
Methods
align()
Perform forced alignment.
result = alignment_model.align(
transcript_segments: List[dict],
audio: Union[np.ndarray, torch.Tensor],
return_char_alignments: bool = False
) -> dict
Parameters:
transcript_segments: List of segments withtext,start,endaudio: Audio waveformreturn_char_alignments: Include character-level alignments
Returns: Dictionary with:
segments: Aligned segments with word-level timestampsword_segments: Flat list of all words
Diarization
DiarizationModel
Speaker diarization using pyannote.
from vllm.model_executor.models.whisperx_diarization import load_diarization_model
diarization_model = load_diarization_model(
model_name="pyannote/speaker-diarization-3.1",
use_auth_token=None, # Uses HF_TOKEN from environment
device="cuda"
)
Methods
call()
Perform speaker diarization.
result = diarization_model(
audio: Union[str, np.ndarray, torch.Tensor],
num_speakers: Optional[int] = None,
min_speakers: Optional[int] = None,
max_speakers: Optional[int] = None,
return_embeddings: bool = False
) -> Union[pd.DataFrame, tuple]
Parameters:
audio: Audio path or waveformnum_speakers: Exact number of speakersmin_speakers: Minimum speakersmax_speakers: Maximum speakersreturn_embeddings: Return speaker embeddings
Returns: Dataframe with columns:
segment: Segment objectlabel: Internal labelspeaker: Speaker IDstart: Start timeend: End time
Or tuple of (dataframe, embeddings) if return_embeddings=True.
assign_word_speakers()
Assign speakers to transcription.
from vllm.model_executor.models.whisperx_diarization import assign_word_speakers
result = assign_word_speakers(
diarize_df: pd.DataFrame,
transcript_result: dict,
speaker_embeddings: Optional[Dict] = None,
fill_nearest: bool = False
) -> dict
Parameters:
diarize_df: Diarization dataframetranscript_result: Transcription resultspeaker_embeddings: Optional embeddingsfill_nearest: Assign speakers even without overlap
Returns: Updated transcript with speaker labels.
Factory Functions
create_whisperx_pipeline()
Create WhisperX pipeline with default configuration.
from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline
pipeline = create_whisperx_pipeline(
model: WhisperXForConditionalGeneration,
enable_alignment: bool = True,
enable_diarization: bool = False,
language: Optional[str] = None,
**kwargs
) -> WhisperXPipeline
Parameters:
model: WhisperX model instanceenable_alignment: Enable forced alignmentenable_diarization: Enable speaker diarizationlanguage: Default language**kwargs: Additional config parameters
Returns: Configured WhisperXPipeline instance.
Data Structures
AudioChunk
@dataclass
class AudioChunk:
audio: np.ndarray # Audio samples
start_time: float # Start time in seconds
end_time: float # End time in seconds
chunk_idx: int # Chunk index
is_last: bool # Last chunk flag
Segment
segment = {
"start": float, # Start time
"end": float, # End time
"text": str, # Transcribed text
"speaker": str, # Speaker ID (if diarization enabled)
"words": List[Word] # Word-level data (if alignment enabled)
}
Word
word = {
"word": str, # Word text
"start": float, # Start time
"end": float, # End time
"score": float, # Confidence score
"speaker": str # Speaker ID (if diarization enabled)
}
Constants
# Audio processing constants
SAMPLE_RATE = 16000 # Target sample rate
CHUNK_LENGTH_S = 30 # Default chunk length
OVERLAP_LENGTH_S = 5 # Default overlap
N_SAMPLES_PER_CHUNK = 480000 # Samples per 30s chunk
Type Hints
from typing import Union, Optional, Dict, List
import numpy as np
import torch
AudioInput = Union[str, np.ndarray, torch.Tensor]
TranscriptionResult = Dict[str, Any]
SegmentList = List[Dict[str, Any]]
Error Handling
All methods may raise:
ValueError: Invalid parametersRuntimeError: Model loading or inference errorsFileNotFoundError: Audio file not foundImportError: Missing dependencies
Example:
try:
result = pipeline.transcribe("audio.wav")
except ValueError as e:
print(f"Invalid audio: {e}")
except RuntimeError as e:
print(f"Transcription failed: {e}")
Version Compatibility
- vLLM: >=0.6.0
- PyTorch: >=2.0.0
- Transformers: >=4.30.0
- pyannote.audio: >=3.1.0 (for diarization)