Urdu Turn Detection (Audio-Only V2)

This model is a high-speed, audio-native turn detection system designed for real-time Urdu voice applications. It uses a Whisper-Tiny encoder combined with an Attention Pooling mechanism to detect if a speaker has finished their turn or is just pausing.

Key Features

  • Low Latency: Optimized for real-time inference (~50ms).
  • Audio-Only: No ASR/Text needed, making it faster and privacy-friendly.
  • Attention-Based: Uses cross-frame attention to focus on prosodic cues like intonation and sentence-ending phonemes.
  • Robustness: Trained specifically to handle silence and "thinking" pauses without false positives.

Architecture

  • Backbone: openai/whisper-tiny (Encoder only).
  • Pooling: Masked Attention Pooling (ignores padding/silence).
  • Classifier: 2-layer MLP head.

Performance

Metric Value
Inference Latency (CPU) ~120ms
Inference Latency (CUDA) ~45ms
F1 Score (Turn Detection) 95%+ (estimated)

Usage

πŸš€ High-Level API (Recommended)

The model can be used directly via the urdu-turn-detector library:

pip install urdu-turn-detector
from urdu_turn_detection import UrduTurnDetector

# Auto-downloads from Hub
detector = UrduTurnDetector.from_pretrained("PuristanLabs1/urdu-turn-v2")

# Predict on file or buffer
result = detector.predict("audio.wav")
print(f"Turn is {result.label} (Conf: {result.confidence})")

☁️ Hugging Face Inference API

This model is also compatible with HF Inference Endpoints using the included handler.py.

Dataset

Trained on a combination of Common Voice 13 (Urdu) and synthetically augmented samples simulating natural turn transitions and interruptions.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train PuristanLabs1/urdu-turn-v2