# Voicemail Detector - CNN (ONNX) Fast and lightweight CNN model for real-time voicemail detection. This model achieves excellent accuracy while maintaining sub-20ms inference time, making it ideal for production phone systems. ## Model Description - **Model Type:** Audio Classification (Binary CNN) - **Architecture:** Convolutional Neural Network with Mel-spectrogram features - **Format:** ONNX with external data - **Input:** 4 seconds of audio at 16kHz (64,000 samples) - **Output:** Binary classification (live_human vs voicemail) - **Model Size:** ~13.2 MB ## Performance Metrics ### Accuracy - **Overall Accuracy:** 80.20% (81/101 test samples) - **Live Human Detection:** 100.00% (34/34 correct) - **Voicemail Detection:** 70.15% (47/67 correct) - **Precision (Voicemail):** 100.00% (no false positives) - **Recall (Voicemail):** 70.15% - **F1 Score:** 82.46% ### Inference Speed - **Average Inference Time:** 10.82ms (CPU) - **Min/Max Time:** 10.01ms / 16.00ms - **Real-time Capable:** Yes (< 50ms) ### Resource Efficiency - **Model Size:** 18.19 MB (in memory) - **Inference Memory:** ~373 MB - **Multi-worker Friendly:** Yes (67x more efficient than Wav2Vec2) ### Comparison with Wav2Vec2 Model The CNN model excels at: - **Speed:** 65x faster (11ms vs 705ms) - **Size:** 67x smaller (18MB vs 1.2GB) - **Live Human Detection:** 100% accuracy (perfect detection) - **Precision:** 100% (no false positives on voicemail detection) - **Simple deployment:** No transformers dependency ## Use Cases This model is ideal for: - 📞 **Real-time phone systems** requiring instant voicemail detection - 🏭 **Production environments** with multiple concurrent workers - ⚡ **Low-latency applications** where response time is critical - 💻 **Resource-constrained deployments** with limited memory - 🎯 **High-precision scenarios** where false positives must be avoided (100% precision) - 👤 **Live human detection** where perfect accuracy is needed (100% on live humans) **Best suited for:** Production systems prioritizing speed, scalability, and zero false positives on voicemail detection. ## Installation ```bash pip install onnxruntime numpy librosa ``` ## Usage ### Basic Inference ```python import numpy as np import onnxruntime as ort import librosa def extract_mel_spectrogram(audio: np.ndarray, sr: int = 16000) -> np.ndarray: """Extract mel-spectrogram features from audio. Args: audio: Audio array of shape (64000,) - 4 seconds at 16kHz sr: Sample rate (default: 16000) Returns: Mel-spectrogram of shape (1, 1, 128, 251) """ # Compute mel-spectrogram mel_spec = librosa.feature.melspectrogram( y=audio, sr=sr, n_fft=512, hop_length=256, n_mels=128, fmin=0, fmax=8000, ) # Convert to log scale (dB) mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max) # Normalize to [0, 1] mel_spec_normalized = (mel_spec_db - mel_spec_db.min()) / ( mel_spec_db.max() - mel_spec_db.min() + 1e-8 ) # Reshape to (1, 1, 128, 251) return mel_spec_normalized.reshape(1, 1, 128, -1) # Load ONNX model session = ort.InferenceSession("model.onnx") # Load audio (4 seconds at 16kHz = 64,000 samples) audio, sr = librosa.load("audio.wav", sr=16000, mono=True) audio_segment = audio[:64000] # Pad if shorter if len(audio_segment) < 64000: audio_segment = np.pad(audio_segment, (0, 64000 - len(audio_segment))) # Extract features mel_spec = extract_mel_spectrogram(audio_segment) # Run inference outputs = session.run(None, {"input": mel_spec.astype(np.float32)}) logits = outputs[0] # Get prediction prediction_idx = np.argmax(logits, axis=-1)[0] result = "voicemail" if prediction_idx == 1 else "live_human" # Get confidence scores probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True) confidence = probabilities[0][prediction_idx] print(f"Detection: {result} (confidence: {confidence:.2%})") ``` ### Real-time Audio Processing ```python import numpy as np import onnxruntime as ort class VoicemailDetector: """Real-time voicemail detector using CNN model.""" def __init__(self, model_path: str, sample_rate: int = 16000): self.session = ort.InferenceSession(model_path) self.sample_rate = sample_rate self.buffer_duration = 4.0 # seconds self.buffer_size = int(sample_rate * self.buffer_duration) self.audio_buffer = np.zeros(self.buffer_size, dtype=np.float32) def add_audio(self, audio_chunk: np.ndarray): """Add audio chunk to buffer (rolling window).""" chunk_size = len(audio_chunk) # Shift buffer and add new audio self.audio_buffer = np.roll(self.audio_buffer, -chunk_size) self.audio_buffer[-chunk_size:] = audio_chunk def detect(self) -> tuple[str, float]: """Detect voicemail from current buffer. Returns: Tuple of (prediction, confidence) """ # Extract features mel_spec = extract_mel_spectrogram(self.audio_buffer, self.sample_rate) # Run inference outputs = self.session.run(None, {"input": mel_spec}) logits = outputs[0] # Get prediction prediction_idx = np.argmax(logits, axis=-1)[0] result = "voicemail" if prediction_idx == 1 else "live_human" # Calculate confidence probabilities = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True) confidence = probabilities[0][prediction_idx] return result, float(confidence) # Usage detector = VoicemailDetector("model.onnx") # Simulate streaming audio for audio_chunk in audio_stream: detector.add_audio(audio_chunk) result, confidence = detector.detect() print(f"{result}: {confidence:.2%}") ``` ## Model Architecture ``` Input: Audio (4s @ 16kHz) → Mel-Spectrogram (128 mels, 251 time steps) ↓ Conv2D (32 filters, 3x3) + ReLU + MaxPool2D ↓ Conv2D (64 filters, 3x3) + ReLU + MaxPool2D ↓ Conv2D (128 filters, 3x3) + ReLU + MaxPool2D ↓ Flatten + Dropout (0.5) ↓ Dense (128) + ReLU + Dropout (0.5) ↓ Dense (2) → Softmax ↓ Output: [live_human_prob, voicemail_prob] ``` ## Important Implementation Notes ### Audio Requirements - **Duration:** Exactly 4 seconds (64,000 samples) - **Sample Rate:** 16kHz - **Channels:** Mono - **Format:** Float32 numpy array ### Feature Extraction The model expects mel-spectrograms with these parameters: - **n_fft:** 512 - **hop_length:** 256 - **n_mels:** 128 - **fmin:** 0 Hz - **fmax:** 8000 Hz - **Normalization:** Min-max scaling to [0, 1] after log-scale conversion ### Model Input/Output **Input:** - Name: `input` - Shape: `[1, 1, 128, 251]` - Type: `float32` - Format: Normalized mel-spectrogram **Output:** - Name: `output` - Shape: `[1, 2]` - Type: `float32` - Classes: `[0: live_human, 1: voicemail]` ## Training Details - **Architecture:** Custom CNN for audio classification - **Training Data:** Curated dataset of voicemail greetings and live human responses - **Optimization:** Focused on voicemail beep and silence detection - **Export Method:** PyTorch → ONNX ## Strengths & Weaknesses ### Strengths ✅ - Perfect live human detection (100% accuracy - never misses a live person) - Perfect precision on voicemail (100% - zero false positives) - Very fast inference (11ms - 65x faster than alternatives) - Tiny memory footprint (18MB - 67x smaller than alternatives) - Simple preprocessing (just mel-spectrograms, no transformers) - Real-time capable for production systems - Multi-worker friendly - Excellent F1 score (82.46%) showing balanced performance ### Weaknesses ❌ - Lower recall on voicemail detection (70.15% - misses some voicemails) - May classify some voicemails as live humans (30% false negatives) - Less sophisticated than transformer-based models - Trade-off: prioritizes not missing live humans over catching all voicemails ## Evaluation Results ### Full Test Dataset (101 samples) | Category | Correct | Total | Accuracy | |----------|---------|-------|----------| | Live Human | 34 | 34 | 100.0% | | Voicemail | 47 | 67 | 70.15% | | **Overall** | **81** | **101** | **80.20%** | ### Confusion Matrix | | Predicted Live Human | Predicted Voicemail | |----------------|---------------------|---------------------| | **Actual Live Human** | 34 (True Negative) | 0 (False Positive) | | **Actual Voicemail** | 20 (False Negative) | 47 (True Positive) | ### Key Metrics Summary - **Precision (Voicemail):** 100.00% - When it says voicemail, it's always correct - **Recall (Voicemail):** 70.15% - Catches 70% of all voicemails - **F1 Score:** 82.46% - Balanced harmonic mean of precision and recall - **Live Human Accuracy:** 100.00% - Never misclassifies a live person **Note:** The model is tuned to never miss a live human caller, which results in some voicemails being classified as live humans. This is ideal for customer service scenarios where missing a live caller is worse than occasionally forwarding a voicemail to a human agent. ## Optimization Tips ### For Production Deployment 1. **Use ONNX Runtime optimizations:** ```python sess_options = ort.SessionOptions() sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL session = ort.InferenceSession("model.onnx", sess_options) ``` 2. **Batch processing for multiple calls:** ```python # Process multiple audio samples at once batch_input = np.stack([mel_spec1, mel_spec2, mel_spec3]) # Shape: (3, 1, 128, 251) outputs = session.run(None, {"input": batch_input}) ``` 3. **Reuse feature extraction:** Cache mel-filterbank computation for faster repeated processing. ## License MIT License - Free for commercial and non-commercial use. ## Model Card Contact For questions or issues, please open an issue in the repository. --- **Related Models:** - [Wav2Vec2 Voicemail Detector](../voicemail-detector-wav2vec2-onnx) - Higher accuracy on live humans (100%) - Recommended for most production use cases due to superior speed and efficiency