Audio Classification
Safetensors
wav2vec2-bert
File size: 9,426 Bytes
f235b7d
ffa29b0
 
 
 
 
d684761
 
ffa29b0
 
 
 
 
7b97465
 
f235b7d
 
 
ffa29b0
f235b7d
441988b
 
 
 
 
 
 
 
 
 
5e75061
441988b
 
 
 
 
b18a8c6
 
 
 
 
 
de8def4
b18a8c6
 
 
 
 
441988b
416ef89
 
 
 
69c98f7
 
416ef89
 
 
 
 
 
 
 
 
 
 
 
 
0355aa3
b18a8c6
 
a34474e
0355aa3
a34474e
 
 
0355aa3
416ef89
 
 
 
 
 
 
 
 
 
a34474e
327b583
a34474e
 
 
 
 
0355aa3
 
 
ef4c077
bfa4134
a34474e
ef4c077
 
8ee18bb
bfa4134
0355aa3
327b583
132d375
777f975
 
 
 
a34474e
 
327b583
132d375
424f11f
a34474e
 
 
 
 
 
 
424f11f
 
 
132d375
bfa4134
327b583
 
 
 
 
 
132d375
a816954
f235b7d
ffa29b0
 
f235b7d
ffa29b0
 
 
 
 
f235b7d
ffa29b0
7758e25
ffa29b0
 
f235b7d
ffa29b0
 
 
 
 
 
 
f235b7d
 
8941911
327b583
 
 
424f11f
327b583
8941911
 
 
 
327b583
 
 
8941911
 
327b583
 
8941911
d1484b0
327b583
 
d1484b0
8941911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
327b583
8941911
 
 
 
 
 
327b583
 
8941911
 
 
ffa29b0
 
 
 
 
 
 
 
 
 
8941911
 
f235b7d
 
ffa29b0
 
8941911
ffa29b0
8941911
 
 
 
ffa29b0
f235b7d
 
416ef89
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
license: apache-2.0
language:
- sl
- hr
- sr
- cs
- pl
base_model:
- facebook/w2v-bert-2.0
pipeline_tag: audio-classification
metrics:
- f1
- recall
- precision
---


# Frame classification for filled pauses


## Model Details

This model classifies individual 20ms frames of audio based on 
presence of filled pauses ("eee", "errm", ...).

### Model Description



- **Developed by:** Peter Rupnik, Nikola Ljubešić, Ivan Porupski, Darinka Verdonik
- **Funded by:** MEZZANINE project
- **Model type:** Wav2Vec2Bert for Audio Frame Classification
- **Language(s) (NLP):** Trained and tested on Slovenian [ROG-Artur](http://hdl.handle.net/11356/1992), evaluated also on Croatian, Serbian, Polish, and Czech samples from the [ParlaSpeech corpus](http://clarinsi.github.io/parlaspeech)
- **Finetuned from model:** facebook/w2v-bert-2.0

## Model reference
If you wish to cite this model, use
```bibtex
@misc{wav2vecbert2-filledPause,
	author       = { Rupnik, Peter and Ljubešić, Nikola and Porupski, Ivan and Verdonik, Darinka },
	title        = { wav2vecbert2-filledPause (Revision 5e75061) },
	year         = 2024,
	url          = { https://huggingface.co/classla/wav2vecbert2-filledPause },
	doi          = { 10.57967/hf/6732 },
	publisher    = { Hugging Face }
}
```

## Paper
```bibtex
@inproceedings{ljubesic-etal-2025-identifying,
    title = "Identifying Filled Pauses in Speech Across South and {W}est {S}lavic Languages",
    author = "Ljube{\v{s}}i{\'c}, Nikola  and Porupski, Ivan  and Rupnik, Peter",
    editor = "Piskorski, Jakub  and P{\v{r}}ib{\'a}{\v{n}}, Pavel  and Nakov, Preslav  and Yangarber, Roman  and Marcinczuk, Michal",
    booktitle = "Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.bsnlp-1.1/",
    doi = "10.18653/v1/2025.bsnlp-1.1",
    pages = "1--8",
    ISBN = "978-1-959429-57-9",
    abstract = "Filled pauses are among the most common paralinguistic features of speech, yet they are mainly omitted from transcripts. We propose a transformer-based approach for detecting filled pauses directly from the speech signal, fine-tuned on Slovenian and evaluated across South and West Slavic languages. Our results show that speech transformers achieve excellent performance in detecting filled pauses when evaluated in the in-language scenario. We further evaluate cross-lingual capabilities of the model on two closely related South Slavic languages (Croatian and Serbian) and two less closely related West Slavic languages (Czech and Polish). Our results reveal strong cross-lingual generalization capabilities of the model, with only minor performance drops. Moreover, error analysis reveals that the model outperforms human annotators in recall and F1 score, while trailing slightly in precision. In addition to evaluating the capabilities of speech transformers for filled pause detection across Slavic languages, we release new multilingual test datasets and make our fine-tuned model publicly available to support further research and applications in spoken language processing."
}
```




# Training data

The model was trained on human-annotated Slovenian speech corpus 
[ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into 
at most 30s long chunks. 

## Training Details

| hyperparameter              | value |
| --------------------------- | ----- |
| learning rate               | 3e-5  |
| effective batch size        | 16    |
| num train epochs            | 20    |



# Evaluation

Although the output of the model is a series 0 or 1, describing their  20ms frames, 
the evaluation was done on event level; spans of consecutive outputs 1 were 
bundled together into one event. When the true and predicted
events partially overlap, this is counted as a true positive. 
We report precisions, recalls, and F1-scores of the positive class.

## Evaluation on ROG corpus

Results for Rog-Art test split:


|lang| postprocessing |   recall |   precision |    F1 |
|---|------:|---------:|------------:|------:|
|SL|none|   0.973  | 0.914     | 0.943 |


## Evaluation on ParlaSpeech corpora

<div style="border: 5px solid #ff6700;  padding: 10px; margin: 10px 0;">
  <strong>Notice:</strong> ParlaSpeech corpora are currently in the process of enrichment with new features. Follow our progress here: <a href="http://clarinsi.github.io/parlaspeech">http://clarinsi.github.io/parlaspeech</a>
</div>

For every language in the 
[ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795), 
400 instances were sampled and annotated by human annotators. 


Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
we observed a few failure modes when inferring.  It was discovered 
that post-processing can be used to improve results. False positives 
were observed to be caused by improper audio segmentation, which is 
why disabling predictions that start at the start of the audio or 
end at the end of the audio can be beneficial. Another failure mode 
is predicting very short events, which is why ignoring very short predictions 
can be safely discarded. 

With added postprocessing, the model achieves the following metrics:


| lang   | postprocessing         |   recall |   precision |    F1 |
|:-------|:-----------------------|---------:|------------:|------:|
| CZ     | drop_short_initial_and_final  |    0.889 |       0.859 | 0.874 |
| HR     | drop_short_initial_and_final  |    0.94  |       0.887 | 0.913 |
| PL     | drop_short_initial_and_final  |    0.903 |       0.947 | 0.924 |
| RS     | drop_short_initial_and_final  |    0.966 |       0.915 | 0.94  |

Fop details on postprocessing see function `frames_to_intervals` in the code snippet below.

# Example use:
```python

from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
from datasets import Dataset, Audio
import torch
import numpy as np
from pathlib import Path

device = torch.device("cuda")
model_name = "classla/wav2vecbert2-filledPause"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)

ds = Dataset.from_dict(
    {
        "audio": [
            "/cache/peterr/mezzanine_resources/filled_pauses/data/dev/Iriss-J-Gvecg-P500001-avd_2082.293_2112.194.wav"
        ],
    }
).cast_column("audio", Audio(sampling_rate=16_000, mono=True))


def frames_to_intervals(
    frames: list[int],
    drop_short=True,
    drop_initial=True,
    drop_final=True,
    short_cutoff_s=0.08,
) -> list[tuple[float]]:
    """Transforms a list of ones or zeros, corresponding to annotations on frame
    levels, to a list of intervals ([start second, end second]).

    Allows for additional filtering on duration (false positives are often
    short) and start times (false positives starting at 0.0 are often an
    artifact of poor segmentation).

    :param list[int] frames: Input frame labels
    :param bool drop_short: Drop everything shorter than short_cutoff_s,
        defaults to True
    :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
    :param bool drop_final: Drop predictions ending at audio end, defaults to True
    :param float short_cutoff_s: Duration in seconds of shortest allowable
        prediction, defaults to 0.08

    :return list[tuple[float]]: List of intervals [start_s, end_s]
    """
    from itertools import pairwise
    import pandas as pd

    results = []
    ndf = pd.DataFrame(
        data={
            "time_s": [0.020 * i for i in range(len(frames))],
            "frames": frames,
        }
    )
    ndf = ndf.dropna()
    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
    for si, ei in pairwise(indices_of_change):
        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
            pass
        else:
            results.append(
                (
                    round(ndf.loc[si, "time_s"], 3),
                    round(ndf.loc[ei, "time_s"], 3),
                )
            )
    if drop_short and (len(results) > 0):
        results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
    if drop_initial and (len(results) > 0):
        results = [i for i in results if i[0] != 0.0]
    if drop_final and (len(results) > 0):
        results = [i for i in results if i[1] != 0.02 * len(frames)]
    return results


def evaluator(chunks):
    sampling_rate = chunks["audio"][0]["sampling_rate"]
    with torch.no_grad():
        inputs = feature_extractor(
            [i["array"] for i in chunks["audio"]],
            return_tensors="pt",
            sampling_rate=sampling_rate,
        ).to(device)
        logits = model(**inputs).logits
    y_pred = np.array(logits.cpu()).argmax(axis=-1)
    intervals = [frames_to_intervals(i) for i in y_pred]
    return {"y_pred": y_pred.tolist(), "intervals": intervals}


ds = ds.map(evaluator, batched=True)
print(ds["y_pred"][0])
# Prints a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0....]
# with 0 indicating no filled pause detected in that frame

print(ds["intervals"][0])
# Prints the identified intervals as a list of [start_s, ends_s]:
# [[0.08, 0.28 ], ...]
```