---
license: apache-2.0
language: en
library_name: pytorch
tags:
- audio
- automatic-speech-recognition
- speaker-verification
- wavlm
---

# WavLM-Base+ with  Multi-Head Factorized Attentive Pooling (MHFA) for Speaker Embeddings

This repository contains the PyTorch implementation of a speaker embedding model that combines the powerful **WavLM-Base+** model with a **Multi-Head Factorized Attentive Pooling** layer.

The model is designed to extract fixed-size speaker characteristic embeddings from raw audio waveforms. 

---

## How to Use

The following example shows how to load the model from this repository and use it to extract a speaker embedding from an audio signal.

### Installation

First, ensure you have PyTorch and the original `SSL_WavLM` library installed.

```bash
pip install torch
# You may also need the original WavLM library from Microsoft
```

### Usage Example

```python
from Transformer_WavLM import WavLM_MHFA
# 1. Load the Model
model_path = './SSL_WavLM/model_convert.pt'
print(f"Loading model from: {model_path}...")
model = WavLM_MHFA(model_path=model_path)
model.eval() # Set the model to evaluation mode for inference
print("Model loaded successfully.")

# 2. Prepare Your Audio Data
# In a real application, you would load a 16kHz audio file here.
# The input tensor shape should be: [batch_size, number_of_samples]
batch_size = 4
audio_samples = 32000  # This represents ~2 seconds of audio at 16kHz
dummy_audio = torch.randn(batch_size, audio_samples)
print(f"\nInput audio shape: {dummy_audio.shape}")

# 3. Extract the Speaker Embedding
speaker_embedding = model(dummy_audio)
print("\nEmbedding extracted successfully!")
print(f"Output embedding shape: {speaker_embedding.shape}")
```
## Citation

If you find MHFA useful, please cite it as

```bibtex
@inproceedings{peng2023attention,
title={An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification},
author={Peng, Junyi and Plchot, Old{\v{r}}ich and Stafylakis, Themos and Mo{\v{s}}ner, Ladislav and Burget, Luk{\'a}{\v{s}} and {\v{C}}ernock{\`y}, Jan},
booktitle={2022 IEEE Spoken Language Technology Workshop (SLT)},
pages={555--562},
year={2023},
organization={IEEE}
}

```