classla
/

wav2vecbert2-filledPause

Audio Classification

Safetensors

wav2vec2-bert

Model card Files Files and versions

xet

Community

5roop commited on Apr 10

Commit

424f11f

verified ·

1 Parent(s): 7b97465

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -33

README.md CHANGED Viewed

@@ -28,21 +28,13 @@ Although the output of the model is a series 0 or 1, describing their  20ms fram
 event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
 events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
-We observed several failure modes of the automatic inferrence process and designed post-processing steps to mitigate them.
-False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or
-end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions
-can be safely discarded.
 ## Evaluation on ROG corpus
-| postprocessing                    |   recall |   precision |    F1 |
-|:-----------------------|---------:|------------:|------:|
-| raw                    |    0.981 |       0.955 | 0.968 |
-| drop_short             |    0.981 |       0.957 | 0.969 |
-| drop_short_initial_and_final  |    0.964 |       0.966 | 0.965 |
-| drop_short_and_initial |    0.964 |       0.966 | 0.965 |
-| drop_initial           |    0.964 |       0.963 | 0.963 |
 ## Evaluation on ParlaSpeech corpora
@@ -50,35 +42,21 @@ can be safely discarded.
 For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
 400 instances were sampled and annotated by human annotators.
-Evaluation on human-annotated instances  produced the following metrics:
 | lang   | postprocessing         |   recall |   precision |    F1 |
 |:-------|:-----------------------|---------:|------------:|------:|
 | CZ     | drop_short_initial_and_final  |    0.889 |       0.859 | 0.874 |
-| CZ     | drop_short_and_initial |    0.889 |       0.859 | 0.874 |
-| CZ     | drop_short             |    0.905 |       0.833 | 0.868 |
-| CZ     | drop_initial           |    0.889 |       0.846 | 0.867 |
-| CZ     | raw                    |    0.905 |       0.814 | 0.857 |
 | HR     | drop_short_initial_and_final  |    0.94  |       0.887 | 0.913 |
-| HR     | drop_short_and_initial |    0.94  |       0.887 | 0.913 |
-| HR     | drop_short             |    0.94  |       0.884 | 0.911 |
-| HR     | drop_initial           |    0.94  |       0.875 | 0.906 |
-| HR     | raw                    |    0.94  |       0.872 | 0.905 |
-| PL     | drop_short             |    0.906 |       0.947 | 0.926 |
 | PL     | drop_short_initial_and_final  |    0.903 |       0.947 | 0.924 |
-| PL     | drop_short_and_initial |    0.903 |       0.947 | 0.924 |
-| PL     | raw                    |    0.91  |       0.924 | 0.917 |
-| PL     | drop_initial           |    0.908 |       0.924 | 0.916 |
-| RS     | drop_short             |    0.966 |       0.915 | 0.94  |
 | RS     | drop_short_initial_and_final  |    0.966 |       0.915 | 0.94  |
-| RS     | drop_short_and_initial |    0.966 |       0.915 | 0.94  |
-| RS     | drop_initial           |    0.974 |       0.9   | 0.936 |
-| RS     | raw                    |    0.974 |       0.9   | 0.936 |
-The metrics reported are on event level, which means that if true and
-predicted filled pauses at least partially overlap, we count them as a
-True Positive event.
@@ -109,7 +87,7 @@ def frames_to_intervals(
     frames: list[int],
     drop_short=True,
     drop_initial=True,
-    drop_final=False,
     short_cutoff_s=0.08,
 ) -> list[tuple[float]]:
     """Transforms a list of ones or zeros, corresponding to annotations on frame

 event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
 events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
 ## Evaluation on ROG corpus
+|   recall |   precision |    F1 |
+|---------:|------------:|------:|
+|    0.981 |       0.955 | 0.968 |
 ## Evaluation on ParlaSpeech corpora
 For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
 400 instances were sampled and annotated by human annotators.
+Since ParlaSpeech corpora are too big to be manually segmented as ROG is, we observed a few failure modes when inferring.  It was discovered that post-processing can be used
+to improve results. False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or
+end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions
+can be safely discarded.
+With added postprocessing, the model achieves the following metrics:
 | lang   | postprocessing         |   recall |   precision |    F1 |
 |:-------|:-----------------------|---------:|------------:|------:|
 | CZ     | drop_short_initial_and_final  |    0.889 |       0.859 | 0.874 |
 | HR     | drop_short_initial_and_final  |    0.94  |       0.887 | 0.913 |
 | PL     | drop_short_initial_and_final  |    0.903 |       0.947 | 0.924 |
 | RS     | drop_short_initial_and_final  |    0.966 |       0.915 | 0.94  |
     frames: list[int],
     drop_short=True,
     drop_initial=True,
+    drop_final=True,
     short_cutoff_s=0.08,
 ) -> list[tuple[float]]:
     """Transforms a list of ones or zeros, corresponding to annotations on frame