ivrit-ai
/

whisper-large-v3-turbo

@@ -27,6 +27,7 @@ This model is a Hebrew finetune (continued training) of the OpenAI Whisper Large
 - **Language(s) (NLP):** Hebrew
 - **License:** Apache-2.0
 - **Finetuned from model** openai/whisper-large-v3-turbo
 ## Bias, Risks, and Limitations
@@ -40,7 +41,7 @@ Additionally, the tanslation task was not trained and also degraded. This model
 Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name.
 You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page.
-We created some simple example scripts using this model and weights for other indference runtimes.
 Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo.
 ## Training Details
@@ -49,13 +50,19 @@ Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/mas
 This model was trained on the following datasets:
-- [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have beem crowd-transcribed segment-by-segment - ~300h
-- [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia atricle snippets. ~50h
-- [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of representitives) plenum protocols. ~325h
 ### Training Procedure
-This model is a weighted-average of the lowest eval loss checkpoints (From around the end of epoch 2) from two seprate runs with the same setup.
 Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training)
 #### Preprocessing
@@ -75,10 +82,10 @@ Datasets were interleaved with 0.15:0.8:0.05 ratio (knesset:crowd-transcribe:cro
 - **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs
 - **Batch Size:** 32
-#### Training Hardward / Duration
 - **GPU Type:** 8 x Nvidia A40 machine
-- **Duration:** ~9h run, stopped at 3 epochs
 ## Evaluation

 - **Language(s) (NLP):** Hebrew
 - **License:** Apache-2.0
 - **Finetuned from model** openai/whisper-large-v3-turbo
+- **Training Date** Apr 2025
 ## Bias, Risks, and Limitations
 Please follow the original [model card](https://huggingface.co/openai/whisper-large-v3-turbo#usage) for usage details - replacing with this model name.
 You can also fine other weight formats ad quantizations on the [ivrit ai](https://huggingface.co/ivrit-ai) HF page.
+We created some simple example scripts using this model and weights for other inference runtimes.
 Find those in the ["examples"](https://github.com/ivrit-ai/asr-training/tree/master/examples) folder within the training GitHub repo.
 ## Training Details
 This model was trained on the following datasets:
+- [ivrit-ai/crowd-transcribe-v5](https://huggingface.co/datasets/ivrit-ai/crowd-transcribe-v5) - Publicly accessible audio sources have been crowd-transcribed segment-by-segment - ~300h
+- [ivrit-ai/crowd-recital-whisper-training](https://huggingface.co/datasets/ivrit-ai/crowd-recital-whisper-training) - Crowd-sourced recording of Wikipedia article snippets. ~50h
+- [ivrit-ai/knesset-plenums-whisper-training](https://huggingface.co/datasets/ivrit-ai/knesset-plenums-whisper-training) - A subset of a Knesset (Israeli house of representatives) plenum protocols. ~4700h
 ### Training Procedure
+This model was trained in two main phases:
+- Knesset based pre-training - over all ~4700h of data - 3 epochs, ~48h run
+- Mixed post-training over all crowd-transcribe-v5 (~300h), crowd-recital-whisper-training (~50h) and highest-quality filtered knessets data (~150h) - 2 epochs
+ - Interleaving of datasets with sampling probs: (0.9, 0.025, 0.075) respectively
+ - Note that crowd-transcribe-v5 has about 5x shorter samples on average thus the over-sampling.
+This model is a weighted-average of the 2 lowest eval loss checkpoints (From around the end of epoch 2) from two seprate runs with the same setup.
 Training code can be found on the ivrit-ai Github [here](https://github.com/ivrit-ai/asr-training)
 #### Preprocessing
 - **Learning Rate:** 1e-5, Linear decay, 800 steps warmup for 3 epochs
 - **Batch Size:** 32
+#### Training Hardware / Duration
 - **GPU Type:** 8 x Nvidia A40 machine
+- **Duration:** ~55h run across both phases
 ## Evaluation