--- base_model: openai/whisper-small language: - tr license: cc0-1.0 tags: - automatic-speech-recognition - whisper - robust-speech - audio-augmentation - generated_from_trainer datasets: - mozilla-foundation/common_voice_23_0 metrics: - wer model-index: - name: whisper-small-tr results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 23.0 (Turkish) type: mozilla-foundation/common_voice_23_0 metrics: - name: Wer type: wer value: 20 --- # Model Card for Whisper Small Turkish This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the [**Mozilla Common Voice 23.0 Turkish**](https://datacollective.mozillafoundation.org/datasets/cmflnuzw71qkz8x3kil3tgjvk) dataset. ## Key Features & Robustness Standard ASR models often fail in noisy environments. This model tackles that problem by applying **JIT (Just-In-Time) Augmentation** during training. The model was exposed to the following synthetic degradations dynamically during the training loop: * **Gaussian Noise Injection:** Simulating background static and environmental noise. * **Time Stretching:** Randomly speeding up or slowing down speech (0.8x - 1.2x) to handle fast/slow speakers. * **Frequency Masking:** Simulating codec loss or bad microphone quality. **Result:** The model demonstrates high resilience to noise, maintaining transcription accuracy even when the input audio has a low Signal-to-Noise Ratio (SNR). ## Performance | Metric | Condition | Performance | | :--- | :--- | :--- | | **WER (Word Error Rate)** | Clean Audio | ~20% | | **WER (Word Error Rate)** | Noisy/Distorted Audio | **~20% (Robust)** | ### WandB [WandB report](https://wandb.ai/ogulcanakca-none/huggingface/runs/rb7qrhxo) ## Usage You can use this model directly with the Hugging Face `pipeline`. ```python import torch from transformers import pipeline # 1. Load the pipeline device = "cuda" if torch.cuda.is_available() else "cpu" pipe = pipeline( "automatic-speech-recognition", model="ogulcanakca/whisper-small-tr", device=device, generate_kwargs={ "length_penalty": 1.5, "no_repeat_ngram_size": 2, "language": "turkish", "task": "transcribe", "compression_ratio_threshold": 1.35 } ) # 2. Transcribe audio (can be a file path or URL) # The model handles resampling automatically. result = pipe("path_to_your_audio.mp3") print(result["text"]) ``` ## Parameter Details * **`per_device_train_batch_size=64`** * **`gradient_accumulation_steps=1`** * **`gradient_checkpointing=False`** * **`fp16=True`** * **`dataloader_num_workers=8`** * **`dataloader_pin_memory=True`** * **`learning_rate=1e-5`** * **`num_train_epochs=5`** * **`per_device_eval_batch_size=32`** * **`predict_with_generate=True`** * **`generation_max_length=225`** * **`save_steps=1000`** * **`eval_steps=1000`** * **`warmup_steps=500`** * **`logging_steps=10`** > The training lasted approximately 67 minutes on the A100 GPU (80 gb).