YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Voicemail Detector - CNN (ONNX)

Fast and lightweight CNN model for real-time voicemail detection. This model achieves excellent accuracy while maintaining sub-20ms inference time, making it ideal for production phone systems.

Model Description

Model Type: Audio Classification (Binary CNN)
Architecture: Convolutional Neural Network with Mel-spectrogram features
Format: ONNX with external data
Input: 4 seconds of audio at 16kHz (64,000 samples)
Output: Binary classification (live_human vs voicemail)
Model Size: ~13.2 MB

Performance Metrics

Accuracy

Overall Accuracy: 80.20% (81/101 test samples)
Live Human Detection: 100.00% (34/34 correct)
Voicemail Detection: 70.15% (47/67 correct)
Precision (Voicemail): 100.00% (no false positives)
Recall (Voicemail): 70.15%
F1 Score: 82.46%

Inference Speed

Average Inference Time: 10.82ms (CPU)
Min/Max Time: 10.01ms / 16.00ms
Real-time Capable: Yes (< 50ms)

Resource Efficiency

Model Size: 18.19 MB (in memory)
Inference Memory: ~373 MB
Multi-worker Friendly: Yes (67x more efficient than Wav2Vec2)

Comparison with Wav2Vec2 Model

The CNN model excels at:

Speed: 65x faster (11ms vs 705ms)
Size: 67x smaller (18MB vs 1.2GB)
Live Human Detection: 100% accuracy (perfect detection)
Precision: 100% (no false positives on voicemail detection)
Simple deployment: No transformers dependency

Use Cases

This model is ideal for:

📞 Real-time phone systems requiring instant voicemail detection
🏭 Production environments with multiple concurrent workers
⚡ Low-latency applications where response time is critical
💻 Resource-constrained deployments with limited memory
🎯 High-precision scenarios where false positives must be avoided (100% precision)
👤 Live human detection where perfect accuracy is needed (100% on live humans)

Best suited for: Production systems prioritizing speed, scalability, and zero false positives on voicemail detection.

Installation

pip install onnxruntime numpy librosa

Usage

Basic Inference

import numpy as np
import onnxruntime as ort
import librosa

def extract_mel_spectrogram(audio: np.ndarray, sr: int = 16000) -> np.ndarray:
    """Extract mel-spectrogram features from audio.
    
    Args:
        audio: Audio array of shape (64000,) - 4 seconds at 16kHz
        sr: Sample rate (default: 16000)
    
    Returns:
        Mel-spectrogram of shape (1, 1, 128, 251)
    """
    # Compute mel-spectrogram
    mel_spec = librosa.feature.melspectrogram(
        y=audio,
        sr=sr,
        n_fft=512,
        hop_length=256,
        n_mels=128,
        fmin=0,
        fmax=8000,
    )
    
    # Convert to log scale (dB)
    mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
    
    # Normalize to [0, 1]
    mel_spec_normalized = (mel_spec_db - mel_spec_db.min()) / (
        mel_spec_db.max() - mel_spec_db.min() + 1e-8
    )
    
    # Reshape to (1, 1, 128, 251)
    return mel_spec_normalized.reshape(1, 1, 128, -1)

# Load ONNX model
session = ort.InferenceSession("model.onnx")

# Load audio (4 seconds at 16kHz = 64,000 samples)
audio, sr = librosa.load("audio.wav", sr=16000, mono=True)
audio_segment = audio[:64000]

# Pad if shorter
if len(audio_segment) < 64000:
    audio_segment = np.pad(audio_segment, (0, 64000 - len(audio_segment)))

# Extract features
mel_spec = extract_mel_spectrogram(audio_segment)

# Run inference
outputs = session.run(None, {"input": mel_spec.astype(np.float32)})
logits = outputs[0]

# Get prediction
prediction_idx = np.argmax(logits, axis=-1)[0]
result = "voicemail" if prediction_idx == 1 else "live_human"

# Get confidence scores
probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)
confidence = probabilities[0][prediction_idx]

print(f"Detection: {result} (confidence: {confidence:.2%})")

Real-time Audio Processing

import numpy as np
import onnxruntime as ort

class VoicemailDetector:
    """Real-time voicemail detector using CNN model."""
    
    def __init__(self, model_path: str, sample_rate: int = 16000):
        self.session = ort.InferenceSession(model_path)
        self.sample_rate = sample_rate
        self.buffer_duration = 4.0  # seconds
        self.buffer_size = int(sample_rate * self.buffer_duration)
        self.audio_buffer = np.zeros(self.buffer_size, dtype=np.float32)
    
    def add_audio(self, audio_chunk: np.ndarray):
        """Add audio chunk to buffer (rolling window)."""
        chunk_size = len(audio_chunk)
        
        # Shift buffer and add new audio
        self.audio_buffer = np.roll(self.audio_buffer, -chunk_size)
        self.audio_buffer[-chunk_size:] = audio_chunk
    
    def detect(self) -> tuple[str, float]:
        """Detect voicemail from current buffer.
        
        Returns:
            Tuple of (prediction, confidence)
        """
        # Extract features
        mel_spec = extract_mel_spectrogram(self.audio_buffer, self.sample_rate)
        
        # Run inference
        outputs = self.session.run(None, {"input": mel_spec})
        logits = outputs[0]
        
        # Get prediction
        prediction_idx = np.argmax(logits, axis=-1)[0]
        result = "voicemail" if prediction_idx == 1 else "live_human"
        
        # Calculate confidence
        probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)
        confidence = probabilities[0][prediction_idx]
        
        return result, float(confidence)

# Usage
detector = VoicemailDetector("model.onnx")

# Simulate streaming audio
for audio_chunk in audio_stream:
    detector.add_audio(audio_chunk)
    result, confidence = detector.detect()
    print(f"{result}: {confidence:.2%}")

Model Architecture

Input: Audio (4s @ 16kHz) → Mel-Spectrogram (128 mels, 251 time steps)
  ↓
Conv2D (32 filters, 3x3) + ReLU + MaxPool2D
  ↓
Conv2D (64 filters, 3x3) + ReLU + MaxPool2D
  ↓
Conv2D (128 filters, 3x3) + ReLU + MaxPool2D
  ↓
Flatten + Dropout (0.5)
  ↓
Dense (128) + ReLU + Dropout (0.5)
  ↓
Dense (2) → Softmax
  ↓
Output: [live_human_prob, voicemail_prob]

Important Implementation Notes

Audio Requirements

Duration: Exactly 4 seconds (64,000 samples)
Sample Rate: 16kHz
Channels: Mono
Format: Float32 numpy array

Feature Extraction

The model expects mel-spectrograms with these parameters:

n_fft: 512
hop_length: 256
n_mels: 128
fmin: 0 Hz
fmax: 8000 Hz
Normalization: Min-max scaling to [0, 1] after log-scale conversion

Model Input/Output

Input:

Name: input
Shape: [1, 1, 128, 251]
Type: float32
Format: Normalized mel-spectrogram

Output:

Name: output
Shape: [1, 2]
Type: float32
Classes: [0: live_human, 1: voicemail]

Training Details

Architecture: Custom CNN for audio classification
Training Data: Curated dataset of voicemail greetings and live human responses
Optimization: Focused on voicemail beep and silence detection
Export Method: PyTorch → ONNX

Strengths & Weaknesses

Strengths ✅

Perfect live human detection (100% accuracy - never misses a live person)
Perfect precision on voicemail (100% - zero false positives)
Very fast inference (11ms - 65x faster than alternatives)
Tiny memory footprint (18MB - 67x smaller than alternatives)
Simple preprocessing (just mel-spectrograms, no transformers)
Real-time capable for production systems
Multi-worker friendly
Excellent F1 score (82.46%) showing balanced performance

Weaknesses ❌

Lower recall on voicemail detection (70.15% - misses some voicemails)
May classify some voicemails as live humans (30% false negatives)
Less sophisticated than transformer-based models
Trade-off: prioritizes not missing live humans over catching all voicemails

Evaluation Results

Full Test Dataset (101 samples)

Category	Correct	Total	Accuracy
Live Human	34	34	100.0%
Voicemail	47	67	70.15%
Overall	81	101	80.20%

Confusion Matrix

	Predicted Live Human	Predicted Voicemail
Actual Live Human	34 (True Negative)	0 (False Positive)
Actual Voicemail	20 (False Negative)	47 (True Positive)

Key Metrics Summary

Precision (Voicemail): 100.00% - When it says voicemail, it's always correct
Recall (Voicemail): 70.15% - Catches 70% of all voicemails
F1 Score: 82.46% - Balanced harmonic mean of precision and recall
Live Human Accuracy: 100.00% - Never misclassifies a live person

Note: The model is tuned to never miss a live human caller, which results in some voicemails being classified as live humans. This is ideal for customer service scenarios where missing a live caller is worse than occasionally forwarding a voicemail to a human agent.

Optimization Tips

For Production Deployment

Use ONNX Runtime optimizations:

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options)

Batch processing for multiple calls:

# Process multiple audio samples at once
batch_input = np.stack([mel_spec1, mel_spec2, mel_spec3])  # Shape: (3, 1, 128, 251)
outputs = session.run(None, {"input": batch_input})

Reuse feature extraction: Cache mel-filterbank computation for faster repeated processing.

License

MIT License - Free for commercial and non-commercial use.

Model Card Contact

For questions or issues, please open an issue in the repository.

Related Models:

Wav2Vec2 Voicemail Detector - Higher accuracy on live humans (100%)
Recommended for most production use cases due to superior speed and efficiency

Downloads last month: 21

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support