File size: 9,426 Bytes
f235b7d ffa29b0 d684761 ffa29b0 7b97465 f235b7d ffa29b0 f235b7d 441988b 5e75061 441988b b18a8c6 de8def4 b18a8c6 441988b 416ef89 69c98f7 416ef89 0355aa3 b18a8c6 a34474e 0355aa3 a34474e 0355aa3 416ef89 a34474e 327b583 a34474e 0355aa3 ef4c077 bfa4134 a34474e ef4c077 8ee18bb bfa4134 0355aa3 327b583 132d375 777f975 a34474e 327b583 132d375 424f11f a34474e 424f11f 132d375 bfa4134 327b583 132d375 a816954 f235b7d ffa29b0 f235b7d ffa29b0 f235b7d ffa29b0 7758e25 ffa29b0 f235b7d ffa29b0 f235b7d 8941911 327b583 424f11f 327b583 8941911 327b583 8941911 327b583 8941911 d1484b0 327b583 d1484b0 8941911 327b583 8941911 327b583 8941911 ffa29b0 8941911 f235b7d ffa29b0 8941911 ffa29b0 8941911 ffa29b0 f235b7d 416ef89 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
---
license: apache-2.0
language:
- sl
- hr
- sr
- cs
- pl
base_model:
- facebook/w2v-bert-2.0
pipeline_tag: audio-classification
metrics:
- f1
- recall
- precision
---
# Frame classification for filled pauses
## Model Details
This model classifies individual 20ms frames of audio based on
presence of filled pauses ("eee", "errm", ...).
### Model Description
- **Developed by:** Peter Rupnik, Nikola Ljubešić, Ivan Porupski, Darinka Verdonik
- **Funded by:** MEZZANINE project
- **Model type:** Wav2Vec2Bert for Audio Frame Classification
- **Language(s) (NLP):** Trained and tested on Slovenian [ROG-Artur](http://hdl.handle.net/11356/1992), evaluated also on Croatian, Serbian, Polish, and Czech samples from the [ParlaSpeech corpus](http://clarinsi.github.io/parlaspeech)
- **Finetuned from model:** facebook/w2v-bert-2.0
## Model reference
If you wish to cite this model, use
```bibtex
@misc{wav2vecbert2-filledPause,
author = { Rupnik, Peter and Ljubešić, Nikola and Porupski, Ivan and Verdonik, Darinka },
title = { wav2vecbert2-filledPause (Revision 5e75061) },
year = 2024,
url = { https://huggingface.co/classla/wav2vecbert2-filledPause },
doi = { 10.57967/hf/6732 },
publisher = { Hugging Face }
}
```
## Paper
```bibtex
@inproceedings{ljubesic-etal-2025-identifying,
title = "Identifying Filled Pauses in Speech Across South and {W}est {S}lavic Languages",
author = "Ljube{\v{s}}i{\'c}, Nikola and Porupski, Ivan and Rupnik, Peter",
editor = "Piskorski, Jakub and P{\v{r}}ib{\'a}{\v{n}}, Pavel and Nakov, Preslav and Yangarber, Roman and Marcinczuk, Michal",
booktitle = "Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.bsnlp-1.1/",
doi = "10.18653/v1/2025.bsnlp-1.1",
pages = "1--8",
ISBN = "978-1-959429-57-9",
abstract = "Filled pauses are among the most common paralinguistic features of speech, yet they are mainly omitted from transcripts. We propose a transformer-based approach for detecting filled pauses directly from the speech signal, fine-tuned on Slovenian and evaluated across South and West Slavic languages. Our results show that speech transformers achieve excellent performance in detecting filled pauses when evaluated in the in-language scenario. We further evaluate cross-lingual capabilities of the model on two closely related South Slavic languages (Croatian and Serbian) and two less closely related West Slavic languages (Czech and Polish). Our results reveal strong cross-lingual generalization capabilities of the model, with only minor performance drops. Moreover, error analysis reveals that the model outperforms human annotators in recall and F1 score, while trailing slightly in precision. In addition to evaluating the capabilities of speech transformers for filled pause detection across Slavic languages, we release new multilingual test datasets and make our fine-tuned model publicly available to support further research and applications in spoken language processing."
}
```
# Training data
The model was trained on human-annotated Slovenian speech corpus
[ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into
at most 30s long chunks.
## Training Details
| hyperparameter | value |
| --------------------------- | ----- |
| learning rate | 3e-5 |
| effective batch size | 16 |
| num train epochs | 20 |
# Evaluation
Although the output of the model is a series 0 or 1, describing their 20ms frames,
the evaluation was done on event level; spans of consecutive outputs 1 were
bundled together into one event. When the true and predicted
events partially overlap, this is counted as a true positive.
We report precisions, recalls, and F1-scores of the positive class.
## Evaluation on ROG corpus
Results for Rog-Art test split:
|lang| postprocessing | recall | precision | F1 |
|---|------:|---------:|------------:|------:|
|SL|none| 0.973 | 0.914 | 0.943 |
## Evaluation on ParlaSpeech corpora
<div style="border: 5px solid #ff6700; padding: 10px; margin: 10px 0;">
<strong>Notice:</strong> ParlaSpeech corpora are currently in the process of enrichment with new features. Follow our progress here: <a href="http://clarinsi.github.io/parlaspeech">http://clarinsi.github.io/parlaspeech</a>
</div>
For every language in the
[ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
400 instances were sampled and annotated by human annotators.
Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
we observed a few failure modes when inferring. It was discovered
that post-processing can be used to improve results. False positives
were observed to be caused by improper audio segmentation, which is
why disabling predictions that start at the start of the audio or
end at the end of the audio can be beneficial. Another failure mode
is predicting very short events, which is why ignoring very short predictions
can be safely discarded.
With added postprocessing, the model achieves the following metrics:
| lang | postprocessing | recall | precision | F1 |
|:-------|:-----------------------|---------:|------------:|------:|
| CZ | drop_short_initial_and_final | 0.889 | 0.859 | 0.874 |
| HR | drop_short_initial_and_final | 0.94 | 0.887 | 0.913 |
| PL | drop_short_initial_and_final | 0.903 | 0.947 | 0.924 |
| RS | drop_short_initial_and_final | 0.966 | 0.915 | 0.94 |
Fop details on postprocessing see function `frames_to_intervals` in the code snippet below.
# Example use:
```python
from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
from datasets import Dataset, Audio
import torch
import numpy as np
from pathlib import Path
device = torch.device("cuda")
model_name = "classla/wav2vecbert2-filledPause"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
ds = Dataset.from_dict(
{
"audio": [
"/cache/peterr/mezzanine_resources/filled_pauses/data/dev/Iriss-J-Gvecg-P500001-avd_2082.293_2112.194.wav"
],
}
).cast_column("audio", Audio(sampling_rate=16_000, mono=True))
def frames_to_intervals(
frames: list[int],
drop_short=True,
drop_initial=True,
drop_final=True,
short_cutoff_s=0.08,
) -> list[tuple[float]]:
"""Transforms a list of ones or zeros, corresponding to annotations on frame
levels, to a list of intervals ([start second, end second]).
Allows for additional filtering on duration (false positives are often
short) and start times (false positives starting at 0.0 are often an
artifact of poor segmentation).
:param list[int] frames: Input frame labels
:param bool drop_short: Drop everything shorter than short_cutoff_s,
defaults to True
:param bool drop_initial: Drop predictions starting at 0.0, defaults to True
:param bool drop_final: Drop predictions ending at audio end, defaults to True
:param float short_cutoff_s: Duration in seconds of shortest allowable
prediction, defaults to 0.08
:return list[tuple[float]]: List of intervals [start_s, end_s]
"""
from itertools import pairwise
import pandas as pd
results = []
ndf = pd.DataFrame(
data={
"time_s": [0.020 * i for i in range(len(frames))],
"frames": frames,
}
)
ndf = ndf.dropna()
indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
for si, ei in pairwise(indices_of_change):
if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
pass
else:
results.append(
(
round(ndf.loc[si, "time_s"], 3),
round(ndf.loc[ei, "time_s"], 3),
)
)
if drop_short and (len(results) > 0):
results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
if drop_initial and (len(results) > 0):
results = [i for i in results if i[0] != 0.0]
if drop_final and (len(results) > 0):
results = [i for i in results if i[1] != 0.02 * len(frames)]
return results
def evaluator(chunks):
sampling_rate = chunks["audio"][0]["sampling_rate"]
with torch.no_grad():
inputs = feature_extractor(
[i["array"] for i in chunks["audio"]],
return_tensors="pt",
sampling_rate=sampling_rate,
).to(device)
logits = model(**inputs).logits
y_pred = np.array(logits.cpu()).argmax(axis=-1)
intervals = [frames_to_intervals(i) for i in y_pred]
return {"y_pred": y_pred.tolist(), "intervals": intervals}
ds = ds.map(evaluator, batched=True)
print(ds["y_pred"][0])
# Prints a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0....]
# with 0 indicating no filled pause detected in that frame
print(ds["intervals"][0])
# Prints the identified intervals as a list of [start_s, ends_s]:
# [[0.08, 0.28 ], ...]
```
|