| # WhisperX API Reference | |
| ## Core Components | |
| ### WhisperXForConditionalGeneration | |
| Main model class for WhisperX. | |
| ```python | |
| from vllm import LLM | |
| llm = LLM(model="openai/whisper-large-v3") | |
| ``` | |
| **Inherits from**: `WhisperForConditionalGeneration` | |
| **Additional Features**: | |
| - Support for forced alignment | |
| - Support for speaker diarization | |
| - Audio chunking for long files | |
| ### WhisperXPipeline | |
| High-level pipeline for complete transcription workflow. | |
| ```python | |
| from vllm.model_executor.models.whisperx_pipeline import WhisperXPipeline, WhisperXConfig | |
| pipeline = WhisperXPipeline(model=model, config=config, language="en") | |
| ``` | |
| #### Constructor | |
| ```python | |
| WhisperXPipeline( | |
| model: WhisperXForConditionalGeneration, | |
| config: WhisperXConfig, | |
| language: Optional[str] = None | |
| ) | |
| ``` | |
| **Parameters**: | |
| - `model`: WhisperX model instance | |
| - `config`: Pipeline configuration | |
| - `language`: Default language code (e.g., "en", "es") | |
| #### Methods | |
| ##### transcribe() | |
| Transcribe audio with full pipeline. | |
| ```python | |
| result = pipeline.transcribe( | |
| audio: Union[str, np.ndarray, torch.Tensor], | |
| batch_size: int = 8, | |
| language: Optional[str] = None, | |
| task: str = "transcribe", | |
| **kwargs | |
| ) -> Dict | |
| ``` | |
| **Parameters**: | |
| - `audio`: Audio file path or waveform | |
| - `batch_size`: Batch size for processing | |
| - `language`: Override default language | |
| - `task`: "transcribe" or "translate" | |
| **Returns**: Dictionary containing: | |
| - `text`: Full transcription | |
| - `segments`: List of segments with timestamps | |
| - `word_segments`: Word-level segments | |
| - `language`: Language code | |
| - `duration`: Audio duration in seconds | |
| - `speaker_embeddings`: (if diarization enabled) Speaker embeddings | |
| ##### cleanup() | |
| Free memory by unloading auxiliary models. | |
| ```python | |
| pipeline.cleanup() | |
| ``` | |
| ## Configuration Classes | |
| ### WhisperXConfig | |
| Pipeline configuration. | |
| ```python | |
| from vllm.model_executor.models.whisperx_pipeline import WhisperXConfig | |
| config = WhisperXConfig( | |
| enable_alignment=True, | |
| enable_diarization=False, | |
| chunk_length_s=30.0, | |
| overlap_length_s=5.0, | |
| compute_type="float16", | |
| device="cuda" | |
| ) | |
| ``` | |
| **Parameters**: | |
| #### Alignment Settings | |
| - `enable_alignment` (bool): Enable forced alignment. Default: `True` | |
| - `alignment_model` (Optional[str]): Custom alignment model name. Default: `None` (auto-select) | |
| #### Diarization Settings | |
| - `enable_diarization` (bool): Enable speaker diarization. Default: `False` | |
| - `diarization_model` (Optional[str]): Diarization model name. Default: `None` (uses pyannote/speaker-diarization-3.1) | |
| - `min_speakers` (Optional[int]): Minimum number of speakers. Default: `None` | |
| - `max_speakers` (Optional[int]): Maximum number of speakers. Default: `None` | |
| - `num_speakers` (Optional[int]): Exact number of speakers. Default: `None` | |
| #### Audio Processing | |
| - `chunk_length_s` (float): Chunk length in seconds. Default: `30.0` | |
| - `overlap_length_s` (float): Overlap between chunks. Default: `5.0` | |
| #### Performance | |
| - `compute_type` (str): Computation type. Options: "float16", "float32", "int8". Default: `"float16"` | |
| - `device` (str): Device to use. Options: "cuda", "cpu". Default: `"cuda"` | |
| ## Audio Processing | |
| ### AudioChunker | |
| Handles chunking of long audio files. | |
| ```python | |
| from vllm.model_executor.models.whisperx_audio import AudioChunker | |
| chunker = AudioChunker( | |
| chunk_length_s=30.0, | |
| overlap_length_s=5.0, | |
| sample_rate=16000 | |
| ) | |
| ``` | |
| #### Methods | |
| ##### chunk() | |
| Split audio into chunks. | |
| ```python | |
| chunks = chunker.chunk(audio: Union[np.ndarray, torch.Tensor]) -> List[AudioChunk] | |
| ``` | |
| **Returns**: List of `AudioChunk` objects with: | |
| - `audio`: Audio samples | |
| - `start_time`: Start time in seconds | |
| - `end_time`: End time in seconds | |
| - `chunk_idx`: Chunk index | |
| - `is_last`: Whether this is the last chunk | |
| ##### merge_chunk_results() | |
| Merge transcription results from multiple chunks. | |
| ```python | |
| merged = chunker.merge_chunk_results( | |
| chunk_results: List[dict], | |
| chunks: List[AudioChunk] | |
| ) -> dict | |
| ``` | |
| ### Audio Utilities | |
| #### load_audio() | |
| Load audio file. | |
| ```python | |
| from vllm.model_executor.models.whisperx_audio import load_audio | |
| audio = load_audio(file_path: str, sr: int = 16000) -> np.ndarray | |
| ``` | |
| #### pad_or_trim() | |
| Pad or trim audio to specified length. | |
| ```python | |
| from vllm.model_executor.models.whisperx_audio import pad_or_trim | |
| audio = pad_or_trim( | |
| array: Union[np.ndarray, torch.Tensor], | |
| length: int = 480000, | |
| axis: int = -1 | |
| ) -> Union[np.ndarray, torch.Tensor] | |
| ``` | |
| #### get_audio_duration() | |
| Get audio duration. | |
| ```python | |
| from vllm.model_executor.models.whisperx_audio import get_audio_duration | |
| duration = get_audio_duration( | |
| audio: Union[np.ndarray, torch.Tensor], | |
| sr: int = 16000 | |
| ) -> float | |
| ``` | |
| ## Alignment | |
| ### AlignmentModel | |
| Forced alignment using Wav2Vec2. | |
| ```python | |
| from vllm.model_executor.models.whisperx_alignment import load_align_model | |
| alignment_model = load_align_model( | |
| language_code="en", | |
| device="cuda", | |
| model_name=None, # Auto-select | |
| model_dir=None | |
| ) | |
| ``` | |
| #### Methods | |
| ##### align() | |
| Perform forced alignment. | |
| ```python | |
| result = alignment_model.align( | |
| transcript_segments: List[dict], | |
| audio: Union[np.ndarray, torch.Tensor], | |
| return_char_alignments: bool = False | |
| ) -> dict | |
| ``` | |
| **Parameters**: | |
| - `transcript_segments`: List of segments with `text`, `start`, `end` | |
| - `audio`: Audio waveform | |
| - `return_char_alignments`: Include character-level alignments | |
| **Returns**: Dictionary with: | |
| - `segments`: Aligned segments with word-level timestamps | |
| - `word_segments`: Flat list of all words | |
| ## Diarization | |
| ### DiarizationModel | |
| Speaker diarization using pyannote. | |
| ```python | |
| from vllm.model_executor.models.whisperx_diarization import load_diarization_model | |
| diarization_model = load_diarization_model( | |
| model_name="pyannote/speaker-diarization-3.1", | |
| use_auth_token=None, # Uses HF_TOKEN from environment | |
| device="cuda" | |
| ) | |
| ``` | |
| #### Methods | |
| ##### __call__() | |
| Perform speaker diarization. | |
| ```python | |
| result = diarization_model( | |
| audio: Union[str, np.ndarray, torch.Tensor], | |
| num_speakers: Optional[int] = None, | |
| min_speakers: Optional[int] = None, | |
| max_speakers: Optional[int] = None, | |
| return_embeddings: bool = False | |
| ) -> Union[pd.DataFrame, tuple] | |
| ``` | |
| **Parameters**: | |
| - `audio`: Audio path or waveform | |
| - `num_speakers`: Exact number of speakers | |
| - `min_speakers`: Minimum speakers | |
| - `max_speakers`: Maximum speakers | |
| - `return_embeddings`: Return speaker embeddings | |
| **Returns**: Dataframe with columns: | |
| - `segment`: Segment object | |
| - `label`: Internal label | |
| - `speaker`: Speaker ID | |
| - `start`: Start time | |
| - `end`: End time | |
| Or tuple of (dataframe, embeddings) if `return_embeddings=True`. | |
| ### assign_word_speakers() | |
| Assign speakers to transcription. | |
| ```python | |
| from vllm.model_executor.models.whisperx_diarization import assign_word_speakers | |
| result = assign_word_speakers( | |
| diarize_df: pd.DataFrame, | |
| transcript_result: dict, | |
| speaker_embeddings: Optional[Dict] = None, | |
| fill_nearest: bool = False | |
| ) -> dict | |
| ``` | |
| **Parameters**: | |
| - `diarize_df`: Diarization dataframe | |
| - `transcript_result`: Transcription result | |
| - `speaker_embeddings`: Optional embeddings | |
| - `fill_nearest`: Assign speakers even without overlap | |
| **Returns**: Updated transcript with speaker labels. | |
| ## Factory Functions | |
| ### create_whisperx_pipeline() | |
| Create WhisperX pipeline with default configuration. | |
| ```python | |
| from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline | |
| pipeline = create_whisperx_pipeline( | |
| model: WhisperXForConditionalGeneration, | |
| enable_alignment: bool = True, | |
| enable_diarization: bool = False, | |
| language: Optional[str] = None, | |
| **kwargs | |
| ) -> WhisperXPipeline | |
| ``` | |
| **Parameters**: | |
| - `model`: WhisperX model instance | |
| - `enable_alignment`: Enable forced alignment | |
| - `enable_diarization`: Enable speaker diarization | |
| - `language`: Default language | |
| - `**kwargs`: Additional config parameters | |
| **Returns**: Configured WhisperXPipeline instance. | |
| ## Data Structures | |
| ### AudioChunk | |
| ```python | |
| @dataclass | |
| class AudioChunk: | |
| audio: np.ndarray # Audio samples | |
| start_time: float # Start time in seconds | |
| end_time: float # End time in seconds | |
| chunk_idx: int # Chunk index | |
| is_last: bool # Last chunk flag | |
| ``` | |
| ### Segment | |
| ```python | |
| segment = { | |
| "start": float, # Start time | |
| "end": float, # End time | |
| "text": str, # Transcribed text | |
| "speaker": str, # Speaker ID (if diarization enabled) | |
| "words": List[Word] # Word-level data (if alignment enabled) | |
| } | |
| ``` | |
| ### Word | |
| ```python | |
| word = { | |
| "word": str, # Word text | |
| "start": float, # Start time | |
| "end": float, # End time | |
| "score": float, # Confidence score | |
| "speaker": str # Speaker ID (if diarization enabled) | |
| } | |
| ``` | |
| ## Constants | |
| ```python | |
| # Audio processing constants | |
| SAMPLE_RATE = 16000 # Target sample rate | |
| CHUNK_LENGTH_S = 30 # Default chunk length | |
| OVERLAP_LENGTH_S = 5 # Default overlap | |
| N_SAMPLES_PER_CHUNK = 480000 # Samples per 30s chunk | |
| ``` | |
| ## Type Hints | |
| ```python | |
| from typing import Union, Optional, Dict, List | |
| import numpy as np | |
| import torch | |
| AudioInput = Union[str, np.ndarray, torch.Tensor] | |
| TranscriptionResult = Dict[str, Any] | |
| SegmentList = List[Dict[str, Any]] | |
| ``` | |
| ## Error Handling | |
| All methods may raise: | |
| - `ValueError`: Invalid parameters | |
| - `RuntimeError`: Model loading or inference errors | |
| - `FileNotFoundError`: Audio file not found | |
| - `ImportError`: Missing dependencies | |
| Example: | |
| ```python | |
| try: | |
| result = pipeline.transcribe("audio.wav") | |
| except ValueError as e: | |
| print(f"Invalid audio: {e}") | |
| except RuntimeError as e: | |
| print(f"Transcription failed: {e}") | |
| ``` | |
| ## Version Compatibility | |
| - vLLM: >=0.6.0 | |
| - PyTorch: >=2.0.0 | |
| - Transformers: >=4.30.0 | |
| - pyannote.audio: >=3.1.0 (for diarization) | |