whisper-small-tr / README.md

ogulcanakca

Update README.md

012c3fa verified 13 days ago

preview code

raw

history blame contribute delete

3.11 kB

metadata

base_model: openai/whisper-small
language:
  - tr
license: cc0-1.0
tags:
  - automatic-speech-recognition
  - whisper
  - robust-speech
  - audio-augmentation
  - generated_from_trainer
datasets:
  - mozilla-foundation/common_voice_23_0
metrics:
  - wer
model-index:
  - name: whisper-small-tr
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 23.0 (Turkish)
          type: mozilla-foundation/common_voice_23_0
        metrics:
          - name: Wer
            type: wer
            value: 20

Model Card for Whisper Small Turkish

This model is a fine-tuned version of openai/whisper-small on the Mozilla Common Voice 23.0 Turkish dataset.

Key Features & Robustness

Standard ASR models often fail in noisy environments. This model tackles that problem by applying JIT (Just-In-Time) Augmentation during training.

The model was exposed to the following synthetic degradations dynamically during the training loop:

Gaussian Noise Injection: Simulating background static and environmental noise.
Time Stretching: Randomly speeding up or slowing down speech (0.8x - 1.2x) to handle fast/slow speakers.
Frequency Masking: Simulating codec loss or bad microphone quality.

Result: The model demonstrates high resilience to noise, maintaining transcription accuracy even when the input audio has a low Signal-to-Noise Ratio (SNR).

Performance

Metric	Condition	Performance
WER (Word Error Rate)	Clean Audio	~20%
WER (Word Error Rate)	Noisy/Distorted Audio	~20% (Robust)

WandB

WandB report

Usage

You can use this model directly with the Hugging Face pipeline.

import torch
from transformers import pipeline

# 1. Load the pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    "automatic-speech-recognition", 
    model="ogulcanakca/whisper-small-tr",
    device=device,
    generate_kwargs={
        "length_penalty": 1.5,  
        "no_repeat_ngram_size": 2, 
        "language": "turkish",   
        "task": "transcribe",   
        "compression_ratio_threshold": 1.35
        }
)

# 2. Transcribe audio (can be a file path or URL)
# The model handles resampling automatically.
result = pipe("path_to_your_audio.mp3")

print(result["text"])

Parameter Details

per_device_train_batch_size=64
gradient_accumulation_steps=1
gradient_checkpointing=False
fp16=True
dataloader_num_workers=8
dataloader_pin_memory=True
learning_rate=1e-5
num_train_epochs=5
per_device_eval_batch_size=32
predict_with_generate=True
generation_max_length=225
save_steps=1000
eval_steps=1000
warmup_steps=500
logging_steps=10

The training lasted approximately 67 minutes on the A100 GPU (80 gb).