AST Fine-Tuned Model for Emotion Classification
AST Fine-Tuned Model for Emotion Classification
This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the CREMA-D dataset, focusing on six emotional categories. The base model was sourced from MIT's pre-trained AST model.
Model Details
- Base Model: 
MIT/ast-finetuned-audioset-10-10-0.4593 - Fine-Tuned Dataset: CREMA-D
 - Architecture: Audio Spectrogram Transformer (AST)
 - Model Type: Single-label classification
 - Input Features: Log-Mel Spectrograms (128 mel bins)
 - Output Classes:
- ANG: Anger
 - DIS: Disgust
 - FEA: Fear
 - HAP: Happiness
 - NEU: Neutral
 - SAD: Sadness
 
 
Model Configuration
- Hidden Size: 768
 - Number of Attention Heads: 12
 - Number of Hidden Layers: 12
 - Patch Size: 16
 - Maximum Length: 1024
 - Dropout Probability: 0.0
 - Activation Function: GELU (Gaussian Error Linear Unit)
 - Optimizer: Adam
 - Learning Rate: 1e-4
 
Training Details
- Dataset: CREMA-D (Emotion-Labeled Speech Data)
 - Data Augmentation:
- Noise injection
 - Time shifting
 - Speed perturbation
 
 - Fine-Tuning Epochs: 5
 - Batch Size: 16
 - Learning Rate Scheduler: Linear decay
 - Best Validation Accuracy: 60.71%
 - Best Checkpoint: 
./results/checkpoint-1119 
How to Use
Load the Model
from transformers import AutoModelForAudioClassification, AutoProcessor
# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")
# Prepare input audio (e.g., waveform) as log-mel spectrogram
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")
# Make predictions
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")
Metrics
Validation Results
- Best Validation Accuracy: 60.71%
 - Validation Loss: 1.1126
 
Evaluation Details
- Eval Dataset: CREMA-D test split
 - Batch Size: 16
 - Number of Steps: 94
 
Limitations
- The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
 - Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.
 
Acknowledgments
This work is based on the Audio Spectrogram Transformer (AST) model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.
License
The model is shared under the MIT License. Refer to the licensing details in the repository.
Citation
If you use this model in your work, please cite:
@misc{ast-finetuned-model,
  author = {forwarder1121},
  title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
  year = {2024},
  url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}
Contact
For questions, reach out to [email protected].
- Downloads last month
 - 4
 
Model tree for forwarder1121/ast-finetuned-model
Base model
MIT/ast-finetuned-audioset-10-10-0.4593