Update README.md
Browse files
README.md
CHANGED
|
@@ -27,6 +27,7 @@ This model is a Hebrew finetune (continued training) of the OpenAI Whisper Large
|
|
| 27 |
- **Language(s) (NLP):** Hebrew
|
| 28 |
- **License:** Apache-2.0
|
| 29 |
- **Finetuned from model** openai/whisper-large-v3-turbo
|
|
|
|
| 30 |
|
| 31 |
## Bias, Risks, and Limitations
|
| 32 |
|
|
@@ -40,7 +41,7 @@ Additionally, the tanslation task was not trained and also degraded. This model
|
|
| 40 |
Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name.
|
| 41 |
You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page.
|
| 42 |
|
| 43 |
-
We created some simple example scripts using this model and weights for other
|
| 44 |
Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo.
|
| 45 |
|
| 46 |
## Training Details
|
|
@@ -49,13 +50,19 @@ Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/mas
|
|
| 49 |
|
| 50 |
This model was trained on the following datasets:
|
| 51 |
|
| 52 |
-
- [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have
|
| 53 |
-
- [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia
|
| 54 |
-
- [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of
|
| 55 |
|
| 56 |
### Training Procedure
|
| 57 |
|
| 58 |
-
This model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training)
|
| 60 |
|
| 61 |
#### Preprocessing
|
|
@@ -75,10 +82,10 @@ Datasets were interleaved with 0.15:0.8:0.05 ratio (knesset:crowd-transcribe:cro
|
|
| 75 |
- **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs
|
| 76 |
- **Batch Size:** 32
|
| 77 |
|
| 78 |
-
#### Training
|
| 79 |
|
| 80 |
- **GPU Type:** 8 x Nvidia A40 machine
|
| 81 |
-
- **Duration:** ~
|
| 82 |
|
| 83 |
## Evaluation
|
| 84 |
|
|
|
|
| 27 |
- **Language(s) (NLP):** Hebrew
|
| 28 |
- **License:** Apache-2.0
|
| 29 |
- **Finetuned from model** openai/whisper-large-v3-turbo
|
| 30 |
+
- **Training Date** Apr 2025
|
| 31 |
|
| 32 |
## Bias, Risks, and Limitations
|
| 33 |
|
|
|
|
| 41 |
Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name.
|
| 42 |
You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page.
|
| 43 |
|
| 44 |
+
We created some simple example scripts using this model and weights for other inference runtimes.
|
| 45 |
Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo.
|
| 46 |
|
| 47 |
## Training Details
|
|
|
|
| 50 |
|
| 51 |
This model was trained on the following datasets:
|
| 52 |
|
| 53 |
+
- [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have been crowd-transcribed segment-by-segment - ~300h
|
| 54 |
+
- [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia article snippets. ~50h
|
| 55 |
+
- [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of representatives) plenum protocols. ~4700h
|
| 56 |
|
| 57 |
### Training Procedure
|
| 58 |
|
| 59 |
+
This model was trained in two main phases:
|
| 60 |
+
- Knesset based pre-training - over all ~4700h of data - 3 epochs, ~48h run
|
| 61 |
+
- Mixed post-training over all crowd-transcribe-v5 (~300h), crowd-recital-whisper-training (~50h) and highest-quality filtered knessets data (~150h) - 2 epochs
|
| 62 |
+
- Interleaving of datasets with sampling probs: (0.9, 0.025, 0.075) respectively
|
| 63 |
+
- Note that crowd-transcribe-v5 has about 5x shorter samples on average thus the over-sampling.
|
| 64 |
+
|
| 65 |
+
This model is a weighted-average of the 2 lowest eval loss checkpoints (From around the end of epoch 2) from two seprate runs with the same setup.
|
| 66 |
Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training)
|
| 67 |
|
| 68 |
#### Preprocessing
|
|
|
|
| 82 |
- **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs
|
| 83 |
- **Batch Size:** 32
|
| 84 |
|
| 85 |
+
#### Training Hardware / Duration
|
| 86 |
|
| 87 |
- **GPU Type:** 8 x Nvidia A40 machine
|
| 88 |
+
- **Duration:** ~55h run across both phases
|
| 89 |
|
| 90 |
## Evaluation
|
| 91 |
|