whisper-small-tr / README.md
ogulcanakca's picture
Update README.md
012c3fa verified
metadata
base_model: openai/whisper-small
language:
  - tr
license: cc0-1.0
tags:
  - automatic-speech-recognition
  - whisper
  - robust-speech
  - audio-augmentation
  - generated_from_trainer
datasets:
  - mozilla-foundation/common_voice_23_0
metrics:
  - wer
model-index:
  - name: whisper-small-tr
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 23.0 (Turkish)
          type: mozilla-foundation/common_voice_23_0
        metrics:
          - name: Wer
            type: wer
            value: 20

Model Card for Whisper Small Turkish

This model is a fine-tuned version of openai/whisper-small on the Mozilla Common Voice 23.0 Turkish dataset.

Key Features & Robustness

Standard ASR models often fail in noisy environments. This model tackles that problem by applying JIT (Just-In-Time) Augmentation during training.

The model was exposed to the following synthetic degradations dynamically during the training loop:

  • Gaussian Noise Injection: Simulating background static and environmental noise.
  • Time Stretching: Randomly speeding up or slowing down speech (0.8x - 1.2x) to handle fast/slow speakers.
  • Frequency Masking: Simulating codec loss or bad microphone quality.

Result: The model demonstrates high resilience to noise, maintaining transcription accuracy even when the input audio has a low Signal-to-Noise Ratio (SNR).

Performance

Metric Condition Performance
WER (Word Error Rate) Clean Audio ~20%
WER (Word Error Rate) Noisy/Distorted Audio ~20% (Robust)

WandB

WandB report

Usage

You can use this model directly with the Hugging Face pipeline.

import torch
from transformers import pipeline

# 1. Load the pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    "automatic-speech-recognition", 
    model="ogulcanakca/whisper-small-tr",
    device=device,
    generate_kwargs={
        "length_penalty": 1.5,  
        "no_repeat_ngram_size": 2, 
        "language": "turkish",   
        "task": "transcribe",   
        "compression_ratio_threshold": 1.35
        }
)

# 2. Transcribe audio (can be a file path or URL)
# The model handles resampling automatically.
result = pipe("path_to_your_audio.mp3")

print(result["text"])

Parameter Details

  • per_device_train_batch_size=64
  • gradient_accumulation_steps=1
  • gradient_checkpointing=False
  • fp16=True
  • dataloader_num_workers=8
  • dataloader_pin_memory=True
  • learning_rate=1e-5
  • num_train_epochs=5
  • per_device_eval_batch_size=32
  • predict_with_generate=True
  • generation_max_length=225
  • save_steps=1000
  • eval_steps=1000
  • warmup_steps=500
  • logging_steps=10

The training lasted approximately 67 minutes on the A100 GPU (80 gb).