---
base_model: openai/whisper-small
language:
- tr
license: cc0-1.0
tags:
- automatic-speech-recognition
- whisper
- robust-speech
- audio-augmentation
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_23_0
metrics:
- wer
model-index:
- name: whisper-small-tr
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 23.0 (Turkish)
      type: mozilla-foundation/common_voice_23_0
    metrics:
    - name: Wer
      type: wer
      value: 20
---

# Model Card for Whisper Small Turkish 

This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the [**Mozilla Common Voice 23.0 Turkish**](https://datacollective.mozillafoundation.org/datasets/cmflnuzw71qkz8x3kil3tgjvk) dataset. 

## Key Features & Robustness

Standard ASR models often fail in noisy environments. This model tackles that problem by applying **JIT (Just-In-Time) Augmentation** during training. 

The model was exposed to the following synthetic degradations dynamically during the training loop:
* **Gaussian Noise Injection:** Simulating background static and environmental noise.
* **Time Stretching:** Randomly speeding up or slowing down speech (0.8x - 1.2x) to handle fast/slow speakers.
* **Frequency Masking:** Simulating codec loss or bad microphone quality.

**Result:** The model demonstrates high resilience to noise, maintaining transcription accuracy even when the input audio has a low Signal-to-Noise Ratio (SNR).

## Performance

| Metric | Condition | Performance |
| :--- | :--- | :--- |
| **WER (Word Error Rate)** | Clean Audio | ~20% |
| **WER (Word Error Rate)** | Noisy/Distorted Audio | **~20% (Robust)** |

### WandB

[WandB report](https://wandb.ai/ogulcanakca-none/huggingface/runs/rb7qrhxo)

## Usage

You can use this model directly with the Hugging Face `pipeline`.

```python
import torch
from transformers import pipeline

# 1. Load the pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    "automatic-speech-recognition", 
    model="ogulcanakca/whisper-small-tr",
    device=device,
    generate_kwargs={
        "length_penalty": 1.5,  
        "no_repeat_ngram_size": 2, 
        "language": "turkish",   
        "task": "transcribe",   
        "compression_ratio_threshold": 1.35
        }
)

# 2. Transcribe audio (can be a file path or URL)
# The model handles resampling automatically.
result = pipe("path_to_your_audio.mp3")

print(result["text"])
```

## Parameter Details

* **`per_device_train_batch_size=64`**
* **`gradient_accumulation_steps=1`**
* **`gradient_checkpointing=False`**
* **`fp16=True`**
* **`dataloader_num_workers=8`**
* **`dataloader_pin_memory=True`**
* **`learning_rate=1e-5`**
* **`num_train_epochs=5`**
* **`per_device_eval_batch_size=32`**
* **`predict_with_generate=True`**
* **`generation_max_length=225`**
* **`save_steps=1000`**
* **`eval_steps=1000`**
* **`warmup_steps=500`**
* **`logging_steps=10`**
  
> The training lasted approximately 67 minutes on the A100 GPU (80 gb).