--- license: apache-2.0 language: en library_name: pytorch tags: - audio - automatic-speech-recognition - speaker-verification - wavlm --- # WavLM-Base+ with Multi-Head Factorized Attentive Pooling (MHFA) for Speaker Embeddings This repository contains the PyTorch implementation of a speaker embedding model that combines the powerful **WavLM-Base+** model with a **Multi-Head Factorized Attentive Pooling** layer. The model is designed to extract fixed-size speaker characteristic embeddings from raw audio waveforms. --- ## How to Use The following example shows how to load the model from this repository and use it to extract a speaker embedding from an audio signal. ### Installation First, ensure you have PyTorch and the original `SSL_WavLM` library installed. ```bash pip install torch # You may also need the original WavLM library from Microsoft ``` ### Usage Example ```python from Transformer_WavLM import WavLM_MHFA # 1. Load the Model model_path = './SSL_WavLM/model_convert.pt' print(f"Loading model from: {model_path}...") model = WavLM_MHFA(model_path=model_path) model.eval() # Set the model to evaluation mode for inference print("Model loaded successfully.") # 2. Prepare Your Audio Data # In a real application, you would load a 16kHz audio file here. # The input tensor shape should be: [batch_size, number_of_samples] batch_size = 4 audio_samples = 32000 # This represents ~2 seconds of audio at 16kHz dummy_audio = torch.randn(batch_size, audio_samples) print(f"\nInput audio shape: {dummy_audio.shape}") # 3. Extract the Speaker Embedding speaker_embedding = model(dummy_audio) print("\nEmbedding extracted successfully!") print(f"Output embedding shape: {speaker_embedding.shape}") ``` ## Citation If you find MHFA useful, please cite it as ```bibtex @inproceedings{peng2023attention, title={An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification}, author={Peng, Junyi and Plchot, Old{\v{r}}ich and Stafylakis, Themos and Mo{\v{s}}ner, Ladislav and Burget, Luk{\'a}{\v{s}} and {\v{C}}ernock{\`y}, Jan}, booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)}, pages={555--562}, year={2023}, organization={IEEE} } ```