metadata
base_model: openai/whisper-small
language:
- tr
license: cc0-1.0
tags:
- automatic-speech-recognition
- whisper
- robust-speech
- audio-augmentation
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_23_0
metrics:
- wer
model-index:
- name: whisper-small-tr
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 23.0 (Turkish)
type: mozilla-foundation/common_voice_23_0
metrics:
- name: Wer
type: wer
value: 20
Model Card for Whisper Small Turkish
This model is a fine-tuned version of openai/whisper-small on the Mozilla Common Voice 23.0 Turkish dataset.
Key Features & Robustness
Standard ASR models often fail in noisy environments. This model tackles that problem by applying JIT (Just-In-Time) Augmentation during training.
The model was exposed to the following synthetic degradations dynamically during the training loop:
- Gaussian Noise Injection: Simulating background static and environmental noise.
- Time Stretching: Randomly speeding up or slowing down speech (0.8x - 1.2x) to handle fast/slow speakers.
- Frequency Masking: Simulating codec loss or bad microphone quality.
Result: The model demonstrates high resilience to noise, maintaining transcription accuracy even when the input audio has a low Signal-to-Noise Ratio (SNR).
Performance
| Metric | Condition | Performance |
|---|---|---|
| WER (Word Error Rate) | Clean Audio | ~20% |
| WER (Word Error Rate) | Noisy/Distorted Audio | ~20% (Robust) |
WandB
Usage
You can use this model directly with the Hugging Face pipeline.
import torch
from transformers import pipeline
# 1. Load the pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="ogulcanakca/whisper-small-tr",
device=device,
generate_kwargs={
"length_penalty": 1.5,
"no_repeat_ngram_size": 2,
"language": "turkish",
"task": "transcribe",
"compression_ratio_threshold": 1.35
}
)
# 2. Transcribe audio (can be a file path or URL)
# The model handles resampling automatically.
result = pipe("path_to_your_audio.mp3")
print(result["text"])
Parameter Details
per_device_train_batch_size=64gradient_accumulation_steps=1gradient_checkpointing=Falsefp16=Truedataloader_num_workers=8dataloader_pin_memory=Truelearning_rate=1e-5num_train_epochs=5per_device_eval_batch_size=32predict_with_generate=Truegeneration_max_length=225save_steps=1000eval_steps=1000warmup_steps=500logging_steps=10
The training lasted approximately 67 minutes on the A100 GPU (80 gb).