Update README.md
Browse files
README.md
CHANGED
|
@@ -16,22 +16,28 @@ metrics:
|
|
| 16 |
|
| 17 |
# Frame classification for filled pauses
|
| 18 |
|
| 19 |
-
This model classifies individual 20ms frames of audio based on
|
|
|
|
| 20 |
|
| 21 |
-
It was trained on human-annotated Slovenian speech corpus [ROG-Artur](http://hdl.handle.net/11356/1992) and achieves F1 of 0.968 for the positive class on
|
| 22 |
-
te test split of the same dataset.
|
| 23 |
|
|
|
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
|
| 29 |
-
events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
## Evaluation on ROG corpus
|
| 33 |
|
| 34 |
|
|
|
|
| 35 |
| postprocessing | recall | precision | F1 |
|
| 36 |
|------:|---------:|------------:|------:|
|
| 37 |
|none| 0.981 | 0.955 | 0.968 |
|
|
@@ -39,13 +45,18 @@ events partially overlap, this is counted as a true positive. We report precisio
|
|
| 39 |
|
| 40 |
## Evaluation on ParlaSpeech corpora
|
| 41 |
|
| 42 |
-
For every language in the
|
|
|
|
| 43 |
400 instances were sampled and annotated by human annotators.
|
| 44 |
|
| 45 |
|
| 46 |
-
Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
|
| 47 |
-
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
can be safely discarded.
|
| 50 |
|
| 51 |
With added postprocessing, the model achieves the following metrics:
|
|
|
|
| 16 |
|
| 17 |
# Frame classification for filled pauses
|
| 18 |
|
| 19 |
+
This model classifies individual 20ms frames of audio based on
|
| 20 |
+
presence of filled pauses ("eee", "errm", ...).
|
| 21 |
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
# Training data
|
| 24 |
|
| 25 |
+
The model was trained on human-annotated Slovenian speech corpus
|
| 26 |
+
[ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into
|
| 27 |
+
at most 30s long chunks.
|
| 28 |
|
| 29 |
+
# Evaluation
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
Although the output of the model is a series 0 or 1, describing their 20ms frames,
|
| 32 |
+
the evaluation was done on event level; spans of consecutive outputs 1 were
|
| 33 |
+
bundled together into one event. When the true and predicted
|
| 34 |
+
events partially overlap, this is counted as a true positive.
|
| 35 |
+
We report precisions, recalls, and F1-scores of the positive class.
|
| 36 |
|
| 37 |
## Evaluation on ROG corpus
|
| 38 |
|
| 39 |
|
| 40 |
+
|
| 41 |
| postprocessing | recall | precision | F1 |
|
| 42 |
|------:|---------:|------------:|------:|
|
| 43 |
|none| 0.981 | 0.955 | 0.968 |
|
|
|
|
| 45 |
|
| 46 |
## Evaluation on ParlaSpeech corpora
|
| 47 |
|
| 48 |
+
For every language in the
|
| 49 |
+
[ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
|
| 50 |
400 instances were sampled and annotated by human annotators.
|
| 51 |
|
| 52 |
|
| 53 |
+
Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
|
| 54 |
+
we observed a few failure modes when inferring. It was discovered
|
| 55 |
+
that post-processing can be used to improve results. False positives
|
| 56 |
+
were observed to be caused by improper audio segmentation, which is
|
| 57 |
+
why disabling predictions that start at the start of the audio or
|
| 58 |
+
end at the end of the audio can be beneficial. Another failure mode
|
| 59 |
+
is predicting very short events, which is why ignoring very short predictions
|
| 60 |
can be safely discarded.
|
| 61 |
|
| 62 |
With added postprocessing, the model achieves the following metrics:
|