Audio Genre Classifier π΅
A PyTorch-based convolutional neural network for music genre classification trained on the GTZAN dataset.
Model Description
This model classifies audio files into 10 different music genres:
- π΅ Blues
- πΌ Classical
- π€ Country
- πΊ Disco
- π€ Hip-Hop
- π· Jazz
- π€ Metal
- π€ Pop
- ποΈ Reggae
- πΈ Rock
The model uses a CNN architecture with mel-spectrogram features extracted from 30-second audio clips.
Model Architecture
- Input: Mel-spectrogram (64 mel bands, 16kHz sample rate)
- Architecture: 3-layer CNN with batch normalization and dropout
- Output: 10 genre classes with softmax probabilities
- Parameters: ~2.5M trainable parameters
Quick Start
import torch
import torch.nn.functional as F
import json
from audio_model import GenureClassifier
# Load the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GenureClassifier(device=device)
model.load('models/model.pth')
model.eval()
# Load class names
with open('models/classes.json', 'r') as f:
classes = json.load(f)
# Classify an audio file
audio_paths = ['path/to/your/audio.wav']
with torch.no_grad():
logits = model(audio_paths)
probabilities = F.softmax(logits, dim=1)
predicted_class = torch.argmax(probabilities, dim=1)
print(f"Predicted genre: {classes[predicted_class.item()]}")
Detailed Usage Example
import torch
import torch.nn.functional as F
import json
from audio_model import GenureClassifier
# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GenureClassifier(device=device)
model.load('models/model.pth')
model.eval()
# Load class names
with open('models/classes.json', 'r') as f:
classes = json.load(f)
# Predict on multiple files
audio_paths = ['song1.wav', 'song2.mp3', 'song3.wav']
with torch.no_grad():
logits = model(audio_paths)
probabilities = F.softmax(logits, dim=1)
# Display detailed results
for i, probs in enumerate(probabilities):
print(f"\nπ File: {audio_paths[i]}")
print("π― Top 3 predictions:")
# Get top 3 predictions
top3_indices = torch.topk(probs, 3).indices
top3_probs = torch.topk(probs, 3).values
for j, (idx, prob) in enumerate(zip(top3_indices, top3_probs)):
emoji_map = {
'blues': 'π΅', 'classical': 'πΌ', 'country': 'π€ ',
'disco': 'πΊ', 'hiphop': 'π€', 'jazz': 'π·',
'metal': 'π€', 'pop': 'π€', 'reggae': 'ποΈ', 'rock': 'πΈ'
}
genre = classes[idx.item()]
emoji = emoji_map.get(genre, 'π΅')
print(f" {j+1}. {emoji} {genre}: {prob.item():.3f}")
Supported Audio Formats
- β WAV (recommended)
- β MP3 (automatically converted to WAV)
- β Other formats supported by torchaudio
Installation
pip install torch torchaudio librosa pydub soundfile numpy
Or install from requirements.txt:
pip install -r requirements.txt
Training Details
- Dataset: GTZAN Music Genre Dataset (1000 tracks, 100 per genre)
- Preprocessing: 30-second clips, 16kHz sample rate, mel-spectrogram with 64 bands
- Training: Adam optimizer, CrossEntropyLoss, batch size 100
- Architecture: CNN with 3 convolutional layers + batch norm + dropout
Dataset Information
β οΈ Important: Due to copyright restrictions, the GTZAN dataset containing copyrighted music excerpts is not included in this repository.
How to get the GTZAN Dataset:
- Official Source: Download from Marsyas Website
- Kaggle: Download from Kaggle GTZAN Dataset
- Academic Access: Check your institution's access through academic databases
Dataset Setup:
After downloading, organize the files as:
kaggle_data/
βββ blues/ # 100 blues tracks (*.wav)
βββ classical/ # 100 classical tracks (*.wav)
βββ country/ # 100 country tracks (*.wav)
βββ disco/ # 100 disco tracks (*.wav)
βββ hiphop/ # 100 hip-hop tracks (*.wav)
βββ jazz/ # 100 jazz tracks (*.wav)
βββ metal/ # 100 metal tracks (*.wav)
βββ pop/ # 100 pop tracks (*.wav)
βββ reggae/ # 100 reggae tracks (*.wav)
βββ rock/ # 100 rock tracks (*.wav)
For more information, see our dataset info repository: storylinez/gtzan-dataset-info
Performance
The model achieves competitive performance on the GTZAN test set. Training and evaluation details can be found in genre_train.ipynb.
Files Structure
βββ audio_model.py # Main model architecture
βββ model_utils.py # Audio preprocessing utilities
βββ inference.py # Standalone inference script
βββ demo.py # Demo with sample files
βββ train.py # Training script
βββ models/ # Model files
β βββ model.pth # Trained model weights
β βββ classes.json # Genre class names
βββ examples/ # Example notebooks and scripts
β βββ training_notebook.ipynb # Jupyter training example
βββ requirements.txt # Python dependencies
βββ README.md # This file
Model Limitations
- β οΈ Trained specifically on 30-second audio clips
- β οΈ Limited to the 10 GTZAN genres
- β οΈ Performance may vary on audio with different characteristics than training data
- β οΈ No data augmentation was applied during training
Contributing
Feel free to contribute by:
- Reporting issues
- Suggesting improvements
- Adding new features
- Extending to more genres
Citation
@misc{ranit-audio-genre-classifier,
title={Audio Genre Classifier},
author={Ranit},
year={2025},
url={https://huggingface.co/storylinez/audio-genre-classifier},
repository={https://github.com/Kawai-Senpai/deep_audio_analysis}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ for the music and AI community!