classla
/

wav2vecbert2-filledPause

Audio Classification

Safetensors

wav2vec2-bert

Model card Files Files and versions

xet

Community

Peter Rupnik commited on Mar 5

Commit

8941911

1 Parent(s): 1cdf067

Add frames_to_intervals function with filtering

Browse files

Files changed (1) hide show

README.md +59 -9

README.md CHANGED Viewed

@@ -22,13 +22,13 @@ te test split of the same dataset.
 # Evaluation
-Although the output of the model is a series 0 or 1, describing their  20ms frames, the evaluation was done on
-event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
 events partially overlap, this is counted as a true positive.
 ## Evaluation on ROG corpus
-In evaluation, we only evaluate positive events, i.e.
 ```
               precision    recall  f1-score   support
@@ -41,18 +41,18 @@ Evaluation on 800 human-annotated instances  ParlaSpeech-HR and ParlaSpeech-RS p
 ```
 Performance on RS:
-Classification report for human vs model on event level:
               precision    recall  f1-score   support
            1       0.95      0.99      0.97       542
 Performance on HR:
-Classification report for human vs model on event level:
               precision    recall  f1-score   support
            1       0.93      0.98      0.95       531
 ```
-The metrics reported are on event level, which means that if true and
-predicted filled pauses at least partially overlap, we count them as a
 True Positive event.
@@ -80,6 +80,51 @@ ds = Dataset.from_dict(
 ).cast_column("audio", Audio(sampling_rate=16_000, mono=True))
 def evaluator(chunks):
     sampling_rate = chunks["audio"][0]["sampling_rate"]
     with torch.no_grad():
@@ -90,13 +135,18 @@ def evaluator(chunks):
         ).to(device)
         logits = model(**inputs).logits
     y_pred = np.array(logits.cpu()).argmax(axis=-1)
-    return {"y_pred": y_pred.tolist()}
 ds = ds.map(evaluator, batched=True)
 print(ds["y_pred"][0])
-# Returns a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,....]
 # with 0 indicating no filled pause detected in that frame
 ```

 # Evaluation
+Although the output of the model is a series 0 or 1, describing their  20ms frames, the evaluation was done on
+event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
 events partially overlap, this is counted as a true positive.
 ## Evaluation on ROG corpus
+In evaluation, we only evaluate positive events, i.e.
 ```
               precision    recall  f1-score   support
 ```
 Performance on RS:
+Classification report for human vs model on event level:
               precision    recall  f1-score   support
            1       0.95      0.99      0.97       542
 Performance on HR:
+Classification report for human vs model on event level:
               precision    recall  f1-score   support
            1       0.93      0.98      0.95       531
 ```
+The metrics reported are on event level, which means that if true and
+predicted filled pauses at least partially overlap, we count them as a
 True Positive event.
 ).cast_column("audio", Audio(sampling_rate=16_000, mono=True))
+def frames_to_intervals(
+    frames: list[int], drop_short=True, drop_initial=True, short_cutoff_s=0.08
+) -> list[tuple[float]]:
+    """Transforms a list of ones or zeros, corresponding to annotations on frame
+    levels, to a list of intervals ([start second, end second]).
+    Allows for additional filtering on duration (false positives are often short)
+    and start times (false positives starting at 0.0 are often an artifact of
+    poor segmentation).
+    :param list[int] frames: Input frame labels
+    :param bool drop_short: Drop everything shorter than short_cutoff_s, defaults to True
+    :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
+    :param float short_cutoff_s: Duration in seconds of shortest allowable prediction, defaults to 0.08
+    :return list[tuple[float]]: List of intervals [start_s, end_s]
+    """
+    from itertools import pairwise
+    import pandas as pd
+    results = []
+    ndf = pd.DataFrame(
+        data={
+            "time_s": [0.020 * i for i in range(len(frames))],
+            "frames": frames,
+        }
+    )
+    ndf = ndf.dropna()
+    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
+    for si, ei in pairwise(indices_of_change):
+        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
+            pass
+        else:
+            results.append(
+                (
+                    round(ndf.loc[si, "time_s"], 3),
+                    round(ndf.loc[ei - 1, "time_s"], 3),
+                )
+            )
+    if drop_short and (len(results) > 0):
+        results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
+    if drop_initial and (len(results) > 0):
+        results = [i for i in results if i[0] != 0.0]
+    return results
 def evaluator(chunks):
     sampling_rate = chunks["audio"][0]["sampling_rate"]
     with torch.no_grad():
         ).to(device)
         logits = model(**inputs).logits
     y_pred = np.array(logits.cpu()).argmax(axis=-1)
+    intervals = [frames_to_intervals(i) for i in y_pred]
+    return {"y_pred": y_pred.tolist(), "intervals": intervals}
 ds = ds.map(evaluator, batched=True)
 print(ds["y_pred"][0])
+# Prints a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0....]
 # with 0 indicating no filled pause detected in that frame
+print(ds["intervals"][0])
+# Prints the identified intervals as a list of [start_s, ends_s]:
+# [[0.08, 0.28 ], ...]
 ```