variants
Browse files
README.md
CHANGED
|
@@ -60,15 +60,16 @@ the [VoxCeleb1-clean test set](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/me
|
|
| 60 |
(EER, lower value denotes a better identification, random prediction leads to a value of 50%) and the associated threshold.
|
| 61 |
This value can be interpreted as the ability to identify speakers only with non-timbral cues. Tests between two utterances leading to a cosine similarity above the threshold should be considered as similar in terms of prosodic cues.
|
| 62 |
|
|
|
|
|
|
|
|
|
|
| 63 |
The table below provides the EER and threshold of the different [variants](#variants) of this model.
|
| 64 |
|
| 65 |
| Variant name| EER (%) | threshold |
|
| 66 |
| --- | --- | --- |
|
| 67 |
| W-PRO | 10.68 | 0.467 |
|
| 68 |
| WNTA128 | 5.00 | 0.282 |
|
| 69 |
-
|
| 70 |
-
A discussion about this interpretation can be
|
| 71 |
-
found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
|
| 72 |
|
| 73 |
Please note that the EER value can vary a little depending on the `max_size` defined to reduce long audios (max 30 seconds in our case).
|
| 74 |
|
|
@@ -112,6 +113,7 @@ The table below provides a short description of the variants and their performan
|
|
| 112 |
| --- | --- | --- | --- |
|
| 113 |
| W-PRO | main | baseline, description in paper | 250 |
|
| 114 |
| WNTA128 | wnta128 | enriched training dataset, more conversions | 128 |
|
|
|
|
| 115 |
|
| 116 |
# License
|
| 117 |
|
|
|
|
| 60 |
(EER, lower value denotes a better identification, random prediction leads to a value of 50%) and the associated threshold.
|
| 61 |
This value can be interpreted as the ability to identify speakers only with non-timbral cues. Tests between two utterances leading to a cosine similarity above the threshold should be considered as similar in terms of prosodic cues.
|
| 62 |
|
| 63 |
+
A discussion about this interpretation can be
|
| 64 |
+
found in the paper mentioned hereafter, as well as other experiments showing correlations between these embeddings and non-timbral voice attributes.
|
| 65 |
+
|
| 66 |
The table below provides the EER and threshold of the different [variants](#variants) of this model.
|
| 67 |
|
| 68 |
| Variant name| EER (%) | threshold |
|
| 69 |
| --- | --- | --- |
|
| 70 |
| W-PRO | 10.68 | 0.467 |
|
| 71 |
| WNTA128 | 5.00 | 0.282 |
|
| 72 |
+
| WNTA64 | 5.13 | 0.332 |
|
|
|
|
|
|
|
| 73 |
|
| 74 |
Please note that the EER value can vary a little depending on the `max_size` defined to reduce long audios (max 30 seconds in our case).
|
| 75 |
|
|
|
|
| 113 |
| --- | --- | --- | --- |
|
| 114 |
| W-PRO | main | baseline, description in paper | 250 |
|
| 115 |
| WNTA128 | wnta128 | enriched training dataset, more conversions | 128 |
|
| 116 |
+
| WNTA64 | wnta64 | enriched training dataset, more conversions | 64 |
|
| 117 |
|
| 118 |
# License
|
| 119 |
|