classla
/

wav2vecbert2-filledPause

Audio Classification

Safetensors

wav2vec2-bert

Model card Files Files and versions

xet

Community

5roop commited on Apr 10

Commit

a34474e

verified ·

1 Parent(s): a816954

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -11

README.md CHANGED Viewed

@@ -16,22 +16,28 @@ metrics:
 # Frame classification for filled pauses
-This model classifies individual 20ms frames of audio based on presence of filled pauses ("eee", "errm", ...).
-It was trained on human-annotated Slovenian speech corpus [ROG-Artur](http://hdl.handle.net/11356/1992) and achieves F1 of 0.968 for the positive class on
-te test split of the same dataset.
-# Evaluation
-Although the output of the model is a series 0 or 1, describing their  20ms frames, the evaluation was done on
-event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
-events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
 ## Evaluation on ROG corpus
 | postprocessing |   recall |   precision |    F1 |
 |------:|---------:|------------:|------:|
 |none|    0.981 |       0.955 | 0.968 |
@@ -39,13 +45,18 @@ events partially overlap, this is counted as a true positive. We report precisio
 ## Evaluation on ParlaSpeech corpora
-For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
 400 instances were sampled and annotated by human annotators.
-Since ParlaSpeech corpora are too big to be manually segmented as ROG is, we observed a few failure modes when inferring.  It was discovered that post-processing can be used
-to improve results. False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or
-end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions
 can be safely discarded.
 With added postprocessing, the model achieves the following metrics:

 # Frame classification for filled pauses
+This model classifies individual 20ms frames of audio based on
+presence of filled pauses ("eee", "errm", ...).
+# Training data
+The model was trained on human-annotated Slovenian speech corpus
+[ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into
+at most 30s long chunks.
+# Evaluation
+Although the output of the model is a series 0 or 1, describing their  20ms frames,
+the evaluation was done on event level; spans of consecutive outputs 1 were
+bundled together into one event. When the true and predicted
+events partially overlap, this is counted as a true positive.
+We report precisions, recalls, and F1-scores of the positive class.
 ## Evaluation on ROG corpus
 | postprocessing |   recall |   precision |    F1 |
 |------:|---------:|------------:|------:|
 |none|    0.981 |       0.955 | 0.968 |
 ## Evaluation on ParlaSpeech corpora
+For every language in the
+[ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
 400 instances were sampled and annotated by human annotators.
+Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
+we observed a few failure modes when inferring.  It was discovered
+that post-processing can be used to improve results. False positives
+were observed to be caused by improper audio segmentation, which is
+why disabling predictions that start at the start of the audio or
+end at the end of the audio can be beneficial. Another failure mode
+is predicting very short events, which is why ignoring very short predictions
 can be safely discarded.
 With added postprocessing, the model achieves the following metrics: