File size: 3,413 Bytes
bb97a1a
65a00e3
a55d583
 
bb97a1a
 
 
 
 
 
 
 
 
 
 
 
5f93079
5b96a1b
65a00e3
697d955
bb97a1a
 
 
 
 
 
 
8393d51
bb97a1a
 
 
 
 
 
 
 
 
8393d51
bb97a1a
 
 
 
 
 
5f93079
 
 
 
 
 
 
 
 
 
 
bb97a1a
5f93079
 
 
 
 
 
 
 
 
735fbff
 
5f93079
735fbff
 
5f93079
 
0eeef3f
 
 
 
 
 
 
5f93079
bb97a1a
65a00e3
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
license: cc-by-nc-4.0
library_name: transformers
pipeline_tag: voice-activity-detection
tags:
- speaker 
- speaker-diarization
- meeting
- wavlm
- wespeaker
- diarizen
- pyannote
- pyannote-audio-pipeline
---

## Overview
This hub features the pre-trained model by [DiariZen](https://github.com/BUTSpeechFIT/DiariZen). The EEND component is built upon WavLM Large and Conformer layers. The model was trained on far-field, single-channel audio from a diverse set of public datasets, including AMI, AISHELL-4, AliMeeting, NOTSOFAR-1, MSDWild, DIHARD3, RAMC, and VoxConverse. 

Then structured pruning at 80% sparsity is applied. After pruning, the number of parameters in WavLM Large is reduced from **316.6M to 63.3M**, and the computational cost (MACs) decreases from **17.8G to 3.8G** per second. When loading this model, please ensure **non-commercial** usage, in accordance with the CC BY-NC 4.0 license.



## Usage
```python
from diarizen.pipelines.inference import DiariZenPipeline

# load pre-trained model
diar_pipeline = DiariZenPipeline.from_pretrained("BUT-FIT/diarizen-wavlm-large-s80-md")
# apply diarization pipeline
diar_results = diar_pipeline('audio.wav')

# print results
for turn, _, speaker in diar_results.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

# load pre-trained model and save RTTM result
diar_pipeline = DiariZenPipeline.from_pretrained(
        "BUT-FIT/diarizen-wavlm-large-s80-md",
        rttm_out_dir='.'
)
# apply diarization pipeline
diar_results = diar_pipeline('audio.wav', sess_name='session_name')
```

## Results (collar=0s) 
| Dataset       | [Pyannote v3.1](https://github.com/pyannote/pyannote-audio) | DiariZen |
|:---------------|:-----------:|:-----------:|
| AMI           | 22.4      | 14.0 |
| AISHELL-4     | 12.2      | 9.8 |
| AliMeeting    | 24.4      | 12.5 |
| NOTSOFAR-1    | -      | 17.9 |
| MSDWild       | 25.3      | 15.6 | 
| DIHARD3       | 21.7      | 14.5 | 
| RAMC          | 22.2      | 11.0 | 
| VoxConverse   | 11.3      | 9.2 |

## Citation
If you found this work helpful, please consider citing:
```
@inproceedings{han2025leveraging,
  title={Leveraging self-supervised learning for speaker diarization},
  author={Han, Jiangyu and Landini, Federico and Rohdin, Johan and Silnova, Anna and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
  booktitle={Proc. ICASSP},
  year={2025}
}

@article{han2025fine,
  title={Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization},
  author={Han, Jiangyu and Landini, Federico and Rohdin, Johan and Silnova, Anna and Diez, Mireia and Cernocky, Jan and Burget, Lukas},
  journal={arXiv preprint arXiv:2505.24111},
  year={2025}
}

@article{han2025efficient,
  title={Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models},
  author={Han, Jiangyu and P{\'a}lka, Petr and Delcroix, Marc and Landini, Federico and Rohdin, Johan and Cernock{\`y}, Jan and Burget, Luk{\'a}{\v{s}}},
  journal={arXiv preprint arXiv:2506.18623},
  year={2025}
}
```

## License
- **Source code**: MIT (see the [project’s GitHub repository](https://github.com/BUTSpeechFIT/DiariZen)).
- **Model weights**: CC BY-NC 4.0 (non-commercial).
- Rationale: some training datasets are research-only or non-commercial, so the released weights cannot be used commercially.