Voice-Based Stress Recognition (StudentNet)
Model Card for forwarder1121/voice-based-stress-recognition
Model Details
- Model name: Voice-Based Stress Recognition (StudentNet)
- Repository: https://huggingface.co/forwarder1121/voice-based-stress-recognition
- License: MIT
- Library version: PyTorch ≥1.7
Model architecture:
A lightweight MLP-based StudentNet distilled from a multimodal TeacherNet trained on the StressID dataset.
- Inputs: 512-dim audio embedding 
- Embedding Spec: - This model expects 512-dimensional embeddings generated by fairseq’s Wav2Vec2 (base) model 
- Layers: - Linear(512→128) → ReLU → Dropout(0.3) → LayerNorm
- Dropout(0.3) → Linear(128→128) → ReLU → Dropout(0.3)
- Linear(128→2) → Softmax
 
Output:
Two-class stress probability:  
- index 0 → “not stressed”
- index 1 → “stressed”
Intended Use & Limitations
Intended use:
- Real-time binary stress detection on edge devices or mobile apps using only audio input.
- Lightweight inference where only pre-computed audio embeddings are available.
Limitations:
- Not designed for multiclass stress intensity prediction.
- Trained on StressID data — performance may degrade on other languages or recording setups.
- Assumes clean audio and accurate W2V embeddings; high background noise may reduce accuracy.
Training Data
- Dataset: StressID
- Modalities collected: ECG, RR, EDA, face/video, voice
- Labels: Self-assessment on 0–10 scale, converted to binary stress (0 if <5, 1 if ≥5)
- Split:  - Used only trainsplit for Teacher training;testsplit held out for final evaluation
- Ensured no subject’s tasks appeared in more than one split
 
- Used only 
Training Procedure
- TeacherNet trained on all four modalities (ECG, RR, EDA, Video) with CrossEntropyLoss.
- StudentNet trained on audio embeddings with a Distillation Loss:loss = CE(student_logits, labels) \ + α * MSE(student_features, teacher_features)
- α ∈ {0, 1e−7, 1e−6}, best performance at α = 1e−6
- Optimizer: AdamW, lr=1e−4, batch_size=8, epochs=100, early stopping patience=100
Evaluation
- TeacherNet (multimodal): - Accuracy ≈ 0.82, Macro-F1 ≈ 0.80, UAR ≈ 0.79
 
- StudentNet (α = 0): - Accuracy ≈ 0.65, Macro-F1 ≈ 0.62, UAR ≈ 0.61
 
- StudentNet (α = 1e−6): - Accuracy ≈ 0.76, Macro-F1 ≈ 0.74, UAR ≈ 0.73
 
⚡️ Wav2Vec2 Embedding Notice
- Audio input for this model should be converted to a 512-dimensional embedding using fairseq's Wav2Vec2 (base) model (torchaudio.pipelines.WAV2VEC2_BASE).
- The exact model weights used for embedding extraction during training are provided as wav2vec_large.ptin the root directory of this repository.
- To use this model for inference on raw audio,  - Load wav2vec_large.ptwith torchaudio/fairseq,
- Generate the 512-dim audio embedding for your input audio,
- Pass this embedding to StudentNet.
 
- Load 
How to Use
Below is a self-contained example that dynamically downloads both the model code (models.py) and the weights from the Hub, then runs inference via the Hugging Face Transformers API—all in one script:
from huggingface_hub import hf_hub_download
import importlib.util
from transformers import AutoConfig, AutoModelForAudioClassification
import torch
import torch.nn.functional as F
def main():
    repo = "forwarder1121/voice-based-stress-recognition"
    # 1) Dynamically download & load the custom models.py
    code_path = hf_hub_download(repo_id=repo, filename="models.py")
    spec = importlib.util.spec_from_file_location("models", code_path)
    models = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(models)
    # now we have models.StudentForAudioClassification and models.StressConfig
    # 2) Load config & model via Transformers (with remote code trust)
    cfg = AutoConfig.from_pretrained(repo, trust_remote_code=True)
    model = AutoModelForAudioClassification.from_pretrained(
        repo,
        trust_remote_code=True,
        torch_dtype="auto"
    )
    model.eval()
    # 3) Prepare a dummy W2V embedding for testing
    #    In real use, replace this with your (1, 512) pre-computed W2V tensor.
    batch_size = 1
    DIM_W2V = 512
    x_w2v = torch.randn(batch_size, DIM_W2V, dtype=next(model.parameters()).dtype)
    # 4) Inference
    with torch.no_grad():
        outputs = model(x_w2v)                # SequenceClassifierOutput
        probs   = F.softmax(outputs.logits, dim=-1)
    print(f"Not stressed: {probs[0,0]*100:.1f}%")
    print(f"Stressed    : {probs[0,1]*100:.1f}%")
if __name__ == "__main__":
    main()
Citation
If you use this model in your research, please cite:
@inproceedings{your2025voice,
  title={Lightweight Audio-Embedding-Based Stress Recognition via Multimodal Knowledge Distillation},
  author={Your Name and …},
  booktitle={Conference/Journal},
  year={2025}
}
Contact: [email protected] Feel free to open an issue or discussion for questions!
- Downloads last month
- 58