Urdu Turn Detection (Audio-Only V2)
This model is a high-speed, audio-native turn detection system designed for real-time Urdu voice applications. It uses a Whisper-Tiny encoder combined with an Attention Pooling mechanism to detect if a speaker has finished their turn or is just pausing.
Key Features
- Low Latency: Optimized for real-time inference (~50ms).
- Audio-Only: No ASR/Text needed, making it faster and privacy-friendly.
- Attention-Based: Uses cross-frame attention to focus on prosodic cues like intonation and sentence-ending phonemes.
- Robustness: Trained specifically to handle silence and "thinking" pauses without false positives.
Architecture
- Backbone:
openai/whisper-tiny(Encoder only). - Pooling: Masked Attention Pooling (ignores padding/silence).
- Classifier: 2-layer MLP head.
Performance
| Metric | Value |
|---|---|
| Inference Latency (CPU) | ~120ms |
| Inference Latency (CUDA) | ~45ms |
| F1 Score (Turn Detection) | 95%+ (estimated) |
Usage
π High-Level API (Recommended)
The model can be used directly via the urdu-turn-detector library:
pip install urdu-turn-detector
from urdu_turn_detection import UrduTurnDetector
# Auto-downloads from Hub
detector = UrduTurnDetector.from_pretrained("PuristanLabs1/urdu-turn-v2")
# Predict on file or buffer
result = detector.predict("audio.wav")
print(f"Turn is {result.label} (Conf: {result.confidence})")
βοΈ Hugging Face Inference API
This model is also compatible with HF Inference Endpoints using the included handler.py.
Dataset
Trained on a combination of Common Voice 13 (Urdu) and synthetically augmented samples simulating natural turn transitions and interruptions.
- Downloads last month
- 11