Audio Genre Classifier 🎡

A PyTorch-based convolutional neural network for music genre classification trained on the GTZAN dataset.

Model Description

This model classifies audio files into 10 different music genres:

  • 🎡 Blues
  • 🎼 Classical
  • 🀠 Country
  • πŸ•Ί Disco
  • 🎀 Hip-Hop
  • 🎷 Jazz
  • 🀘 Metal
  • 🎀 Pop
  • 🏝️ Reggae
  • 🎸 Rock

The model uses a CNN architecture with mel-spectrogram features extracted from 30-second audio clips.

Model Architecture

  • Input: Mel-spectrogram (64 mel bands, 16kHz sample rate)
  • Architecture: 3-layer CNN with batch normalization and dropout
  • Output: 10 genre classes with softmax probabilities
  • Parameters: ~2.5M trainable parameters

Quick Start

import torch
import torch.nn.functional as F
import json
from audio_model import GenureClassifier

# Load the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GenureClassifier(device=device)
model.load('models/model.pth')
model.eval()

# Load class names
with open('models/classes.json', 'r') as f:
    classes = json.load(f)

# Classify an audio file
audio_paths = ['path/to/your/audio.wav']
with torch.no_grad():
    logits = model(audio_paths)
    probabilities = F.softmax(logits, dim=1)
    predicted_class = torch.argmax(probabilities, dim=1)

print(f"Predicted genre: {classes[predicted_class.item()]}")

Detailed Usage Example

import torch
import torch.nn.functional as F
import json
from audio_model import GenureClassifier

# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GenureClassifier(device=device)
model.load('models/model.pth')
model.eval()

# Load class names
with open('models/classes.json', 'r') as f:
    classes = json.load(f)

# Predict on multiple files
audio_paths = ['song1.wav', 'song2.mp3', 'song3.wav']
with torch.no_grad():
    logits = model(audio_paths)
    probabilities = F.softmax(logits, dim=1)

# Display detailed results
for i, probs in enumerate(probabilities):
    print(f"\nπŸ“ File: {audio_paths[i]}")
    print("🎯 Top 3 predictions:")
    
    # Get top 3 predictions
    top3_indices = torch.topk(probs, 3).indices
    top3_probs = torch.topk(probs, 3).values
    
    for j, (idx, prob) in enumerate(zip(top3_indices, top3_probs)):
        emoji_map = {
            'blues': '🎡', 'classical': '🎼', 'country': '🀠', 
            'disco': 'πŸ•Ί', 'hiphop': '🎀', 'jazz': '🎷',
            'metal': '🀘', 'pop': '🎀', 'reggae': '🏝️', 'rock': '🎸'
        }
        genre = classes[idx.item()]
        emoji = emoji_map.get(genre, '🎡')
        print(f"  {j+1}. {emoji} {genre}: {prob.item():.3f}")

Supported Audio Formats

  • βœ… WAV (recommended)
  • βœ… MP3 (automatically converted to WAV)
  • βœ… Other formats supported by torchaudio

Installation

pip install torch torchaudio librosa pydub soundfile numpy

Or install from requirements.txt:

pip install -r requirements.txt

Training Details

  • Dataset: GTZAN Music Genre Dataset (1000 tracks, 100 per genre)
  • Preprocessing: 30-second clips, 16kHz sample rate, mel-spectrogram with 64 bands
  • Training: Adam optimizer, CrossEntropyLoss, batch size 100
  • Architecture: CNN with 3 convolutional layers + batch norm + dropout

Dataset Information

⚠️ Important: Due to copyright restrictions, the GTZAN dataset containing copyrighted music excerpts is not included in this repository.

How to get the GTZAN Dataset:

  1. Official Source: Download from Marsyas Website
  2. Kaggle: Download from Kaggle GTZAN Dataset
  3. Academic Access: Check your institution's access through academic databases

Dataset Setup:

After downloading, organize the files as:

kaggle_data/
β”œβ”€β”€ blues/      # 100 blues tracks (*.wav)
β”œβ”€β”€ classical/  # 100 classical tracks (*.wav)
β”œβ”€β”€ country/    # 100 country tracks (*.wav)
β”œβ”€β”€ disco/      # 100 disco tracks (*.wav)
β”œβ”€β”€ hiphop/     # 100 hip-hop tracks (*.wav)
β”œβ”€β”€ jazz/       # 100 jazz tracks (*.wav)
β”œβ”€β”€ metal/      # 100 metal tracks (*.wav)
β”œβ”€β”€ pop/        # 100 pop tracks (*.wav)
β”œβ”€β”€ reggae/     # 100 reggae tracks (*.wav)
└── rock/       # 100 rock tracks (*.wav)

For more information, see our dataset info repository: storylinez/gtzan-dataset-info

Performance

The model achieves competitive performance on the GTZAN test set. Training and evaluation details can be found in genre_train.ipynb.

Files Structure

β”œβ”€β”€ audio_model.py          # Main model architecture
β”œβ”€β”€ model_utils.py          # Audio preprocessing utilities  
β”œβ”€β”€ inference.py            # Standalone inference script
β”œβ”€β”€ demo.py                 # Demo with sample files
β”œβ”€β”€ train.py                # Training script
β”œβ”€β”€ models/                 # Model files
β”‚   β”œβ”€β”€ model.pth          # Trained model weights
β”‚   └── classes.json       # Genre class names
β”œβ”€β”€ examples/              # Example notebooks and scripts
β”‚   └── training_notebook.ipynb  # Jupyter training example
β”œβ”€β”€ requirements.txt       # Python dependencies
└── README.md             # This file

Model Limitations

  • ⚠️ Trained specifically on 30-second audio clips
  • ⚠️ Limited to the 10 GTZAN genres
  • ⚠️ Performance may vary on audio with different characteristics than training data
  • ⚠️ No data augmentation was applied during training

Contributing

Feel free to contribute by:

  • Reporting issues
  • Suggesting improvements
  • Adding new features
  • Extending to more genres

Citation

@misc{ranit-audio-genre-classifier,
  title={Audio Genre Classifier},
  author={Ranit},
  year={2025},
  url={https://huggingface.co/storylinez/audio-genre-classifier},
  repository={https://github.com/Kawai-Senpai/deep_audio_analysis}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❀️ for the music and AI community!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support