AST Fine-Tuned Model for Emotion Classification

This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the CREMA-D dataset, focusing on six emotional categories. The base model was sourced from MIT's pre-trained AST model.

Model Details

Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
Fine-Tuned Dataset: CREMA-D
Architecture: Audio Spectrogram Transformer (AST)
Model Type: Single-label classification
Input Features: Log-Mel Spectrograms (128 mel bins)
Output Classes:
- ANG: Anger
- DIS: Disgust
- FEA: Fear
- HAP: Happiness
- NEU: Neutral
- SAD: Sadness

Model Configuration

Hidden Size: 768
Number of Attention Heads: 12
Number of Hidden Layers: 12
Patch Size: 16
Maximum Length: 1024
Dropout Probability: 0.0
Activation Function: GELU (Gaussian Error Linear Unit)
Optimizer: Adam
Learning Rate: 1e-4

Training Details

Dataset: CREMA-D (Emotion-Labeled Speech Data)
Data Augmentation:
- Noise injection
- Time shifting
- Speed perturbation
Fine-Tuning Epochs: 5
Batch Size: 16
Learning Rate Scheduler: Linear decay
Best Validation Accuracy: 60.71%
Best Checkpoint: ./results/checkpoint-1119

How to Use

Load the Model

from transformers import AutoModelForAudioClassification, AutoProcessor

# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")

# Prepare input audio (e.g., waveform) as log-mel spectrogram
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")

# Make predictions
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()

print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")

Metrics

Validation Results

Best Validation Accuracy: 60.71%
Validation Loss: 1.1126

Evaluation Details

Eval Dataset: CREMA-D test split
Batch Size: 16
Number of Steps: 94

Limitations

The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.

Acknowledgments

This work is based on the Audio Spectrogram Transformer (AST) model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.

License

The model is shared under the MIT License. Refer to the licensing details in the repository.

Citation

If you use this model in your work, please cite:

@misc{ast-finetuned-model,
  author = {forwarder1121},
  title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
  year = {2024},
  url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}

Contact

For questions, reach out to [email protected].

Downloads last month: 4

Safetensors

Model size

86.2M params

Tensor type

F32

Model tree for forwarder1121/ast-finetuned-model

Base model

MIT/ast-finetuned-audioset-10-10-0.4593

Finetuned

(143)

this model