Audio Genre Classifier 🎵

A PyTorch-based convolutional neural network for music genre classification trained on the GTZAN dataset.

Model Description

This model classifies audio files into 10 different music genres:

🎵 Blues
🎼 Classical
🤠 Country
🕺 Disco
🎤 Hip-Hop
🎷 Jazz
🤘 Metal
🎤 Pop
🏝️ Reggae
🎸 Rock

The model uses a CNN architecture with mel-spectrogram features extracted from 30-second audio clips.

Model Architecture

Input: Mel-spectrogram (64 mel bands, 16kHz sample rate)
Architecture: 3-layer CNN with batch normalization and dropout
Output: 10 genre classes with softmax probabilities
Parameters: ~2.5M trainable parameters

Quick Start

import torch
import torch.nn.functional as F
import json
from audio_model import GenureClassifier

# Load the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GenureClassifier(device=device)
model.load('models/model.pth')
model.eval()

# Load class names
with open('models/classes.json', 'r') as f:
    classes = json.load(f)

# Classify an audio file
audio_paths = ['path/to/your/audio.wav']
with torch.no_grad():
    logits = model(audio_paths)
    probabilities = F.softmax(logits, dim=1)
    predicted_class = torch.argmax(probabilities, dim=1)

print(f"Predicted genre: {classes[predicted_class.item()]}")

Detailed Usage Example

import torch
import torch.nn.functional as F
import json
from audio_model import GenureClassifier

# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GenureClassifier(device=device)
model.load('models/model.pth')
model.eval()

# Load class names
with open('models/classes.json', 'r') as f:
    classes = json.load(f)

# Predict on multiple files
audio_paths = ['song1.wav', 'song2.mp3', 'song3.wav']
with torch.no_grad():
    logits = model(audio_paths)
    probabilities = F.softmax(logits, dim=1)

# Display detailed results
for i, probs in enumerate(probabilities):
    print(f"\n📁 File: {audio_paths[i]}")
    print("🎯 Top 3 predictions:")
    
    # Get top 3 predictions
    top3_indices = torch.topk(probs, 3).indices
    top3_probs = torch.topk(probs, 3).values
    
    for j, (idx, prob) in enumerate(zip(top3_indices, top3_probs)):
        emoji_map = {
            'blues': '🎵', 'classical': '🎼', 'country': '🤠', 
            'disco': '🕺', 'hiphop': '🎤', 'jazz': '🎷',
            'metal': '🤘', 'pop': '🎤', 'reggae': '🏝️', 'rock': '🎸'
        }
        genre = classes[idx.item()]
        emoji = emoji_map.get(genre, '🎵')
        print(f"  {j+1}. {emoji} {genre}: {prob.item():.3f}")

Supported Audio Formats

✅ WAV (recommended)
✅ MP3 (automatically converted to WAV)
✅ Other formats supported by torchaudio

Installation

pip install torch torchaudio librosa pydub soundfile numpy

Or install from requirements.txt:

pip install -r requirements.txt

Training Details

Dataset: GTZAN Music Genre Dataset (1000 tracks, 100 per genre)
Preprocessing: 30-second clips, 16kHz sample rate, mel-spectrogram with 64 bands
Training: Adam optimizer, CrossEntropyLoss, batch size 100
Architecture: CNN with 3 convolutional layers + batch norm + dropout

Dataset Information

⚠️ Important: Due to copyright restrictions, the GTZAN dataset containing copyrighted music excerpts is not included in this repository.

How to get the GTZAN Dataset:

Official Source: Download from Marsyas Website
Kaggle: Download from Kaggle GTZAN Dataset
Academic Access: Check your institution's access through academic databases

Dataset Setup:

After downloading, organize the files as:

kaggle_data/
├── blues/      # 100 blues tracks (*.wav)
├── classical/  # 100 classical tracks (*.wav)
├── country/    # 100 country tracks (*.wav)
├── disco/      # 100 disco tracks (*.wav)
├── hiphop/     # 100 hip-hop tracks (*.wav)
├── jazz/       # 100 jazz tracks (*.wav)
├── metal/      # 100 metal tracks (*.wav)
├── pop/        # 100 pop tracks (*.wav)
├── reggae/     # 100 reggae tracks (*.wav)
└── rock/       # 100 rock tracks (*.wav)

For more information, see our dataset info repository: storylinez/gtzan-dataset-info

Performance

The model achieves competitive performance on the GTZAN test set. Training and evaluation details can be found in genre_train.ipynb.

Files Structure

├── audio_model.py          # Main model architecture
├── model_utils.py          # Audio preprocessing utilities  
├── inference.py            # Standalone inference script
├── demo.py                 # Demo with sample files
├── train.py                # Training script
├── models/                 # Model files
│   ├── model.pth          # Trained model weights
│   └── classes.json       # Genre class names
├── examples/              # Example notebooks and scripts
│   └── training_notebook.ipynb  # Jupyter training example
├── requirements.txt       # Python dependencies
└── README.md             # This file

Model Limitations

⚠️ Trained specifically on 30-second audio clips
⚠️ Limited to the 10 GTZAN genres
⚠️ Performance may vary on audio with different characteristics than training data
⚠️ No data augmentation was applied during training

Contributing

Feel free to contribute by:

Reporting issues
Suggesting improvements
Adding new features
Extending to more genres

Citation

@misc{ranit-audio-genre-classifier,
  title={Audio Genre Classifier},
  author={Ranit},
  year={2025},
  url={https://huggingface.co/storylinez/audio-genre-classifier},
  repository={https://github.com/Kawai-Senpai/deep_audio_analysis}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Made with ❤️ for the music and AI community!

Downloads last month: -; Downloads are not tracked for this model. How to track