๐ŸŽฏ Namo Turn Detector v1 - MultiLingual

License ONNX Model Size Inference Speed

๐Ÿš€ Namo Turn Detection Model for Multiple Languages

๐Ÿ‡ธ๐Ÿ‡ฆ Arabic, ๐Ÿ‡ฎ๐Ÿ‡ณ Bengali, ๐Ÿ‡จ๐Ÿ‡ณ Chinese, ๐Ÿ‡ฉ๐Ÿ‡ฐ Danish, ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch, ๐Ÿ‡ฉ๐Ÿ‡ช German, ๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡บ๐Ÿ‡ธ English, ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish, ๐Ÿ‡ซ๐Ÿ‡ท French, ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi, ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian, ๐Ÿ‡ฎ๐Ÿ‡น Italian, ๐Ÿ‡ฏ๐Ÿ‡ต Japanese, ๐Ÿ‡ฐ๐Ÿ‡ท Korean, ๐Ÿ‡ฎ๐Ÿ‡ณ Marathi, ๐Ÿ‡ณ๐Ÿ‡ด Norwegian, ๐Ÿ‡ต๐Ÿ‡ฑ Polish, ๐Ÿ‡ต๐Ÿ‡น Portuguese, ๐Ÿ‡ท๐Ÿ‡บ Russian, ๐Ÿ‡ช๐Ÿ‡ธ Spanish, ๐Ÿ‡น๐Ÿ‡ท Turkish, ๐Ÿ‡บ๐Ÿ‡ฆ Ukrainian, and ๐Ÿ‡ป๐Ÿ‡ณ Vietnamese


๐Ÿ“‹ Overview

The Namo Turn Detector is a specialized AI model designed to solve one of the most challenging problems in conversational AI: knowing when a user has finished speaking.

This Multilingual model uses advanced natural language understanding to distinguish between:

  • โœ… Complete utterances (user is done speaking)
  • ๐Ÿ”„ Incomplete utterances (user will continue speaking)

Built on mmBERT architecture and optimized with quantized ONNX format, it delivers enterprise-grade performance with minimal latency.

๐Ÿ”‘ Key Features

  • Turn Detection Specialist: Detects end-of-turn vs. continuation in multilingual speech transcripts.
  • Low Latency: Optimized with quantized ONNX for <29ms inference.
  • Robust Performance: Average 90.25% accuracy on multilingual utterances.
  • Easy Integration: Compatible with Python, ONNX Runtime, and VideoSDK Agents SDK.
  • Enterprise Ready: Supports real-time conversational AI and voice assistants.

๐Ÿ“Š Performance Metrics

Metric Score
โšก Latency <29ms
๐Ÿ’พ Model Size ~295MB
Language Accuracy Precision Recall F1 Score Samples
๐Ÿ‡น๐Ÿ‡ท Turkish 0.9731 0.9611 0.9853 0.9730 966
๐Ÿ‡ฐ๐Ÿ‡ท Korean 0.9685 0.9541 0.9842 0.9690 890
๐Ÿ‡ฉ๐Ÿ‡ช German 0.9425 0.9135 0.9772 0.9443 1322
๐Ÿ‡ฏ๐Ÿ‡ต Japanese 0.9436 0.9099 0.9857 0.9463 834
๐Ÿ‡ฎ๐Ÿ‡ณ Hindi 0.9398 0.9276 0.9603 0.9436 1295
๐Ÿ‡ณ๐Ÿ‡ฑ Dutch 0.9279 0.8959 0.9738 0.9332 1401
๐Ÿ‡ณ๐Ÿ‡ด Norwegian 0.9165 0.8717 0.9801 0.9227 1976
๐Ÿ‡จ๐Ÿ‡ณ Chinese 0.9164 0.8859 0.9608 0.9219 945
๐Ÿ‡ซ๐Ÿ‡ฎ Finnish 0.9158 0.8746 0.9702 0.9199 1010
๐Ÿ‡ฌ๐Ÿ‡ง English 0.9086 0.8507 0.9801 0.9108 2845
๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian 0.9022 0.8514 0.9707 0.9071 971
๐Ÿ‡ฎ๐Ÿ‡น Italian 0.9015 0.8562 0.9640 0.9069 782
๐Ÿ‡ต๐Ÿ‡ฑ Polish 0.9068 0.8619 0.9568 0.9069 976
๐Ÿ‡ต๐Ÿ‡น Portuguese 0.8956 0.8410 0.9676 0.8999 1398
๐Ÿ‡ฉ๐Ÿ‡ฐ Danish 0.8973 0.8517 0.9644 0.9045 779
๐Ÿ‡ช๐Ÿ‡ธ Spanish 0.8888 0.8304 0.9681 0.8940 1295
๐Ÿ‡ฎ๐Ÿ‡ณ Marathi 0.8850 0.8762 0.9008 0.8883 774
๐Ÿ‡ท๐Ÿ‡บ Russian 0.8748 0.8318 0.9547 0.8890 1470
๐Ÿ‡บ๐Ÿ‡ฆ Ukrainian 0.8794 0.8164 0.9587 0.8819 929
๐Ÿ‡ป๐Ÿ‡ณ Vietnamese 0.8645 0.8135 0.9439 0.8738 1004
๐Ÿ‡ธ๐Ÿ‡ฆ Arabic 0.8490 0.7965 0.9439 0.8639 947
๐Ÿ‡ฎ๐Ÿ‡ณ Bengali 0.7940 0.7874 0.7939 0.7907 1000

๐Ÿ“Š Evaluated on 25,000+ Multilingual utterances from diverse conversational contexts

โšก๏ธ Speed Analysis

Alt text

๐Ÿ”ง Train & Test Scripts

Train Script Test Script

๐Ÿ› ๏ธ Installation

To use this model, you will need to install the following libraries.

pip install onnxruntime transformers huggingface_hub

๐Ÿš€ Quick Start

You can run inference directly from Hugging Face repository.

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

class TurnDetector:
    def __init__(self, repo_id="videosdk-live/Namo-Turn-Detector-v1-Multilingual"):
        """
        Initializes the detector by downloading the model and tokenizer
        from the Hugging Face Hub.
        """
        print(f"Loading model from repo: {repo_id}")
        
        # Download the model and tokenizer from the Hub
        # Authentication is handled automatically if you are logged in
        model_path = hf_hub_download(repo_id=repo_id, filename="model_quant.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(repo_id)
        
        # Set up the ONNX Runtime inference session
        self.session = ort.InferenceSession(model_path)
        self.max_length = 8192
        print("โœ… Model and tokenizer loaded successfully.")

    def predict(self, text: str) -> tuple:
        """
        Predicts if a given text utterance is the end of a turn.
        Returns (predicted_label, confidence) where:
        - predicted_label: 0 for "Not End of Turn", 1 for "End of Turn"
        - confidence: confidence score between 0 and 1
        """
        # Tokenize the input text
        inputs = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            return_tensors="np"
        )
        
        # Prepare the feed dictionary for the ONNX model
        feed_dict = {
            "input_ids": inputs["input_ids"],
            "attention_mask": inputs["attention_mask"]
        }
        
        # Run inference
        outputs = self.session.run(None, feed_dict)
        logits = outputs[0]

        probabilities = self._softmax(logits[0])
        predicted_label = np.argmax(probabilities)
        confidence = float(np.max(probabilities))

        return predicted_label, confidence

    def _softmax(self, x, axis=None):
        if axis is None:
            axis = -1
        exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
        return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

# --- Example Usage ---
if __name__ == "__main__":
    detector = TurnDetector()
    
    sentences = [
        "They're often made with oil or sugar.",                         # Expected: End of Turn
        "I think the next logical step is to",                           # Expected: Not End of Turn
        "What are you doing tonight?",                                   # Expected: End of Turn
        "The Revenue Act of 1862 adopted rates that increased with",     # Expected: Not End of Turn
    ]
    
    for sentence in sentences:
        predicted_label, confidence = detector.predict(sentence)
        result = "End of Turn" if predicted_label == 1 else "Not End of Turn"
        print(f"'{sentence}' -> {result} (confidence: {confidence:.3f})")
        print("-" * 50)

๐Ÿค– VideoSDK Agents Integration

Integrate this turn detector directly with VideoSDK Agents for production-ready conversational AI applications.

from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

#download model
pre_download_namo_turn_v1_model()

# Initialize Multilingual turn detector for VideoSDK Agents
turn_detector = NamoTurnDetectorV1()

๐Ÿ“š Complete Integration Guide - Learn how to use NamoTurnDetectorV1 with VideoSDK Agents

๐Ÿ“– Citation

@model{namo_turn_detector_en_2025,
  title={Namo Turn Detector v1: Multilingual},
  author={VideoSDK Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Multilingual},
  note={ONNX-optimized mmBERT for turn detection in 23 Languages}
}

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Made with โค๏ธ by the VideoSDK Team

VideoSDK

Downloads last month
42
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for videosdk-live/Namo-Turn-Detector-v1-Multilingual

Quantized
(1)
this model

Evaluation results

  • accuracy on Namo Turn Detector v1 Test - Multilingual
    self-reported
    0.902