whisperx-vllm / vllm /docs /whisperx_api.md
WhisperX Team
WhisperX-vLLM: Production-ready integration (HF release)
bf31d48
# WhisperX API Reference
## Core Components
### WhisperXForConditionalGeneration
Main model class for WhisperX.
```python
from vllm import LLM
llm = LLM(model="openai/whisper-large-v3")
```
**Inherits from**: `WhisperForConditionalGeneration`
**Additional Features**:
- Support for forced alignment
- Support for speaker diarization
- Audio chunking for long files
### WhisperXPipeline
High-level pipeline for complete transcription workflow.
```python
from vllm.model_executor.models.whisperx_pipeline import WhisperXPipeline, WhisperXConfig
pipeline = WhisperXPipeline(model=model, config=config, language="en")
```
#### Constructor
```python
WhisperXPipeline(
model: WhisperXForConditionalGeneration,
config: WhisperXConfig,
language: Optional[str] = None
)
```
**Parameters**:
- `model`: WhisperX model instance
- `config`: Pipeline configuration
- `language`: Default language code (e.g., "en", "es")
#### Methods
##### transcribe()
Transcribe audio with full pipeline.
```python
result = pipeline.transcribe(
audio: Union[str, np.ndarray, torch.Tensor],
batch_size: int = 8,
language: Optional[str] = None,
task: str = "transcribe",
**kwargs
) -> Dict
```
**Parameters**:
- `audio`: Audio file path or waveform
- `batch_size`: Batch size for processing
- `language`: Override default language
- `task`: "transcribe" or "translate"
**Returns**: Dictionary containing:
- `text`: Full transcription
- `segments`: List of segments with timestamps
- `word_segments`: Word-level segments
- `language`: Language code
- `duration`: Audio duration in seconds
- `speaker_embeddings`: (if diarization enabled) Speaker embeddings
##### cleanup()
Free memory by unloading auxiliary models.
```python
pipeline.cleanup()
```
## Configuration Classes
### WhisperXConfig
Pipeline configuration.
```python
from vllm.model_executor.models.whisperx_pipeline import WhisperXConfig
config = WhisperXConfig(
enable_alignment=True,
enable_diarization=False,
chunk_length_s=30.0,
overlap_length_s=5.0,
compute_type="float16",
device="cuda"
)
```
**Parameters**:
#### Alignment Settings
- `enable_alignment` (bool): Enable forced alignment. Default: `True`
- `alignment_model` (Optional[str]): Custom alignment model name. Default: `None` (auto-select)
#### Diarization Settings
- `enable_diarization` (bool): Enable speaker diarization. Default: `False`
- `diarization_model` (Optional[str]): Diarization model name. Default: `None` (uses pyannote/speaker-diarization-3.1)
- `min_speakers` (Optional[int]): Minimum number of speakers. Default: `None`
- `max_speakers` (Optional[int]): Maximum number of speakers. Default: `None`
- `num_speakers` (Optional[int]): Exact number of speakers. Default: `None`
#### Audio Processing
- `chunk_length_s` (float): Chunk length in seconds. Default: `30.0`
- `overlap_length_s` (float): Overlap between chunks. Default: `5.0`
#### Performance
- `compute_type` (str): Computation type. Options: "float16", "float32", "int8". Default: `"float16"`
- `device` (str): Device to use. Options: "cuda", "cpu". Default: `"cuda"`
## Audio Processing
### AudioChunker
Handles chunking of long audio files.
```python
from vllm.model_executor.models.whisperx_audio import AudioChunker
chunker = AudioChunker(
chunk_length_s=30.0,
overlap_length_s=5.0,
sample_rate=16000
)
```
#### Methods
##### chunk()
Split audio into chunks.
```python
chunks = chunker.chunk(audio: Union[np.ndarray, torch.Tensor]) -> List[AudioChunk]
```
**Returns**: List of `AudioChunk` objects with:
- `audio`: Audio samples
- `start_time`: Start time in seconds
- `end_time`: End time in seconds
- `chunk_idx`: Chunk index
- `is_last`: Whether this is the last chunk
##### merge_chunk_results()
Merge transcription results from multiple chunks.
```python
merged = chunker.merge_chunk_results(
chunk_results: List[dict],
chunks: List[AudioChunk]
) -> dict
```
### Audio Utilities
#### load_audio()
Load audio file.
```python
from vllm.model_executor.models.whisperx_audio import load_audio
audio = load_audio(file_path: str, sr: int = 16000) -> np.ndarray
```
#### pad_or_trim()
Pad or trim audio to specified length.
```python
from vllm.model_executor.models.whisperx_audio import pad_or_trim
audio = pad_or_trim(
array: Union[np.ndarray, torch.Tensor],
length: int = 480000,
axis: int = -1
) -> Union[np.ndarray, torch.Tensor]
```
#### get_audio_duration()
Get audio duration.
```python
from vllm.model_executor.models.whisperx_audio import get_audio_duration
duration = get_audio_duration(
audio: Union[np.ndarray, torch.Tensor],
sr: int = 16000
) -> float
```
## Alignment
### AlignmentModel
Forced alignment using Wav2Vec2.
```python
from vllm.model_executor.models.whisperx_alignment import load_align_model
alignment_model = load_align_model(
language_code="en",
device="cuda",
model_name=None, # Auto-select
model_dir=None
)
```
#### Methods
##### align()
Perform forced alignment.
```python
result = alignment_model.align(
transcript_segments: List[dict],
audio: Union[np.ndarray, torch.Tensor],
return_char_alignments: bool = False
) -> dict
```
**Parameters**:
- `transcript_segments`: List of segments with `text`, `start`, `end`
- `audio`: Audio waveform
- `return_char_alignments`: Include character-level alignments
**Returns**: Dictionary with:
- `segments`: Aligned segments with word-level timestamps
- `word_segments`: Flat list of all words
## Diarization
### DiarizationModel
Speaker diarization using pyannote.
```python
from vllm.model_executor.models.whisperx_diarization import load_diarization_model
diarization_model = load_diarization_model(
model_name="pyannote/speaker-diarization-3.1",
use_auth_token=None, # Uses HF_TOKEN from environment
device="cuda"
)
```
#### Methods
##### __call__()
Perform speaker diarization.
```python
result = diarization_model(
audio: Union[str, np.ndarray, torch.Tensor],
num_speakers: Optional[int] = None,
min_speakers: Optional[int] = None,
max_speakers: Optional[int] = None,
return_embeddings: bool = False
) -> Union[pd.DataFrame, tuple]
```
**Parameters**:
- `audio`: Audio path or waveform
- `num_speakers`: Exact number of speakers
- `min_speakers`: Minimum speakers
- `max_speakers`: Maximum speakers
- `return_embeddings`: Return speaker embeddings
**Returns**: Dataframe with columns:
- `segment`: Segment object
- `label`: Internal label
- `speaker`: Speaker ID
- `start`: Start time
- `end`: End time
Or tuple of (dataframe, embeddings) if `return_embeddings=True`.
### assign_word_speakers()
Assign speakers to transcription.
```python
from vllm.model_executor.models.whisperx_diarization import assign_word_speakers
result = assign_word_speakers(
diarize_df: pd.DataFrame,
transcript_result: dict,
speaker_embeddings: Optional[Dict] = None,
fill_nearest: bool = False
) -> dict
```
**Parameters**:
- `diarize_df`: Diarization dataframe
- `transcript_result`: Transcription result
- `speaker_embeddings`: Optional embeddings
- `fill_nearest`: Assign speakers even without overlap
**Returns**: Updated transcript with speaker labels.
## Factory Functions
### create_whisperx_pipeline()
Create WhisperX pipeline with default configuration.
```python
from vllm.model_executor.models.whisperx_pipeline import create_whisperx_pipeline
pipeline = create_whisperx_pipeline(
model: WhisperXForConditionalGeneration,
enable_alignment: bool = True,
enable_diarization: bool = False,
language: Optional[str] = None,
**kwargs
) -> WhisperXPipeline
```
**Parameters**:
- `model`: WhisperX model instance
- `enable_alignment`: Enable forced alignment
- `enable_diarization`: Enable speaker diarization
- `language`: Default language
- `**kwargs`: Additional config parameters
**Returns**: Configured WhisperXPipeline instance.
## Data Structures
### AudioChunk
```python
@dataclass
class AudioChunk:
audio: np.ndarray # Audio samples
start_time: float # Start time in seconds
end_time: float # End time in seconds
chunk_idx: int # Chunk index
is_last: bool # Last chunk flag
```
### Segment
```python
segment = {
"start": float, # Start time
"end": float, # End time
"text": str, # Transcribed text
"speaker": str, # Speaker ID (if diarization enabled)
"words": List[Word] # Word-level data (if alignment enabled)
}
```
### Word
```python
word = {
"word": str, # Word text
"start": float, # Start time
"end": float, # End time
"score": float, # Confidence score
"speaker": str # Speaker ID (if diarization enabled)
}
```
## Constants
```python
# Audio processing constants
SAMPLE_RATE = 16000 # Target sample rate
CHUNK_LENGTH_S = 30 # Default chunk length
OVERLAP_LENGTH_S = 5 # Default overlap
N_SAMPLES_PER_CHUNK = 480000 # Samples per 30s chunk
```
## Type Hints
```python
from typing import Union, Optional, Dict, List
import numpy as np
import torch
AudioInput = Union[str, np.ndarray, torch.Tensor]
TranscriptionResult = Dict[str, Any]
SegmentList = List[Dict[str, Any]]
```
## Error Handling
All methods may raise:
- `ValueError`: Invalid parameters
- `RuntimeError`: Model loading or inference errors
- `FileNotFoundError`: Audio file not found
- `ImportError`: Missing dependencies
Example:
```python
try:
result = pipeline.transcribe("audio.wav")
except ValueError as e:
print(f"Invalid audio: {e}")
except RuntimeError as e:
print(f"Transcription failed: {e}")
```
## Version Compatibility
- vLLM: >=0.6.0
- PyTorch: >=2.0.0
- Transformers: >=4.30.0
- pyannote.audio: >=3.1.0 (for diarization)