# Beat Tracking Challenge A challenge for detecting beats and downbeats in music audio, with a focus on handling dynamic tempo changes common in rhythm game charts. ## Goal The goal is to **detect and identify beats and downbeats** in audio to assist composers by providing a flexible timing grid when working with samples that have dynamic BPM changes. - **Beat**: A regular pulse in music (e.g., quarter notes in 4/4 time) - **Downbeat**: The first beat of each measure (the "1" in counting "1-2-3-4") This is particularly useful for: - Music production with samples of varying tempos - Rhythm game chart creation and verification - Audio analysis and music information retrieval (MIR) --- ## Dataset The dataset is derived from Taiko no Tatsujin rhythm game charts, providing high-quality human-annotated beat and downbeat ground truth. **Source**: [`JacobLinCool/taiko-1000-parsed`](https://huggingface.co/datasets/JacobLinCool/taiko-1000-parsed) | Split | Tracks | Duration | Description | |-------|--------|----------|-------------| | `train` | ~1000 | 1-3 min each | Training data with beat/downbeat annotations | | `test` | ~100 | 1-3 min each | Held-out test set for final evaluation | ### Data Features Each example contains: | Field | Type | Description | |-------|------|-------------| | `audio` | `Audio` | Audio waveform at 16kHz sample rate | | `title` | `str` | Track title | | `beats` | `list[float]` | Beat timestamps in seconds | | `downbeats` | `list[float]` | Downbeat timestamps in seconds | ### Dataset Characteristics - **Dynamic BPM**: Many tracks feature tempo changes mid-song - **Variable Time Signatures**: Common patterns include 4/4, 3/4, 6/8, and more exotic meters - **Diverse Genres**: Japanese pop, anime themes, classical arrangements, electronic music - **High-Quality Annotations**: Derived from professional rhythm game charts --- ## Evaluation Metrics The evaluation considers both **timing accuracy** and **metrical correctness**. Models are evaluated on both beat and downbeat detection tasks. ### Primary Metrics #### 1. Weighted F1-Score (Main Ranking Metric) F1-scores are calculated at multiple timing thresholds (3ms to 30ms), then combined with inverse-threshold weighting: | Threshold | Weight | Rationale | |-----------|--------|-----------| | 3ms | 1.000 | Full weight for highest precision | | 6ms | 0.500 | Half weight | | 9ms | 0.333 | One-third weight | | 12ms | 0.250 | ... | | 15ms | 0.200 | | | 18ms | 0.167 | | | 21ms | 0.143 | | | 24ms | 0.125 | | | 27ms | 0.111 | | | 30ms | 0.100 | Minimum weight for coarsest threshold | **Formula:** ``` Weighted F1 = Σ(w_t × F1_t) / Σ(w_t) where w_t = 3ms / t (inverse threshold weighting) ``` This weighting scheme rewards models that achieve high precision at tight tolerances while still considering coarser thresholds. #### 2. Continuity Metrics (CMLt, AMLt) Based on the MIREX beat tracking evaluation protocol using `mir_eval`: | Metric | Full Name | Description | |--------|-----------|-------------| | **CMLt** | Correct Metrical Level Total | Percentage of beats correctly tracked at the exact metrical level (±17.5% of beat interval) | | **AMLt** | Any Metrical Level Total | Same as CMLt, but allows for acceptable metrical variations (double/half tempo, off-beat) | | **CMLc** | Correct Metrical Level Continuous | Longest continuous correctly-tracked segment at exact metrical level | | **AMLc** | Any Metrical Level Continuous | Longest continuous segment at any acceptable metrical level | **Note:** Continuity metrics use a default `min_beat_time=5.0s` (skipping the first 5 seconds) to avoid evaluating potentially unstable tempo at the beginning of tracks. ### Metric Interpretation | Metric | What it measures | Good Score | |--------|------------------|------------| | Weighted F1 | Precise timing accuracy | > 0.7 | | CMLt | Correct tempo tracking | > 0.8 | | AMLt | Tempo tracking (flexible) | > 0.9 | | CMLc | Longest stable segment | > 0.5 | ### Evaluation Summary For each model, we report: ``` Beat Detection: Weighted F1: X.XXXX CMLt: X.XXXX AMLt: X.XXXX CMLc: X.XXXX AMLc: X.XXXX Downbeat Detection: Weighted F1: X.XXXX CMLt: X.XXXX AMLt: X.XXXX CMLc: X.XXXX AMLc: X.XXXX Combined Weighted F1: X.XXXX (average of beat and downbeat) ``` ### Benchmark Results Results evaluated on 100 tracks from the test set: | Model | Combined F1 | Beat F1 | Downbeat F1 | CMLt (Beat) | CMLt (Downbeat) | |-------|-------------|---------|-------------|-------------|-----------------| | **Baseline 1 (ODCNN)** | 0.0765 | 0.0861 | 0.0669 | 0.0731 | 0.0321 | | **Baseline 2 (ResNet-SE)** | **0.2775** | **0.3292** | **0.2258** | **0.3287** | **0.1146** | *Note: Baseline 2 (ResNet-SE) demonstrates significantly better performance due to larger context window and deeper architecture.* --- ## Quick Start ### Setup ```bash uv sync ``` ### Train Models ```bash # Train Baseline 1 (ODCNN) uv run -m exp.baseline1.train # Train Baseline 2 (ResNet-SE) uv run -m exp.baseline2.train # Train specific target only (e.g. for Baseline 2) uv run -m exp.baseline2.train --target beats uv run -m exp.baseline2.train --target downbeats ``` ### Run Evaluation ```bash # Evaluation (replace baseline1 with baseline2 to evaluate the new model) uv run -m exp.baseline1.eval # Full evaluation with visualization and audio uv run -m exp.baseline1.eval --visualize --synthesize --summary-plot # Evaluate on more samples with custom output directory uv run -m exp.baseline1.eval --num-samples 50 --output-dir outputs/eval_baseline1 ``` ### Evaluation Options | Option | Description | |--------|-------------| | Option | Description | |--------|-------------| | `--model-dir DIR` | Model directory (default: `outputs/baseline1`) | | `--num-samples N` | Number of samples to evaluate (default: 20) | | `--output-dir DIR` | Output directory (default: `outputs/eval`) | | `--visualize` | Generate visualization plots for each track | | `--synthesize` | Generate audio files with click tracks | | `--viz-tracks N` | Number of tracks to visualize/synthesize (default: 5) | | `--time-range START END` | Limit visualization time range (seconds) | | `--click-volume FLOAT` | Click sound volume (0.0 to 1.0, default: 0.5) | | `--summary-plot` | Generate summary evaluation bar charts | --- ## Visualization & Audio Tools ### Beat Visualization Generate plots comparing predicted vs ground truth beats: ```bash uv run -m exp.baseline1.eval --visualize --viz-tracks 10 ``` Output: `outputs/eval/plots/track_XXX.png` ### Click Track Audio Generate audio files with click sounds overlaid on the original music: ```bash uv run -m exp.baseline1.eval --synthesize ``` Output files in `outputs/eval/audio/`: - `track_XXX_pred.wav` - Original audio + predicted beat clicks (1000Hz beat, 1500Hz downbeat) - `track_XXX_gt.wav` - Original audio + ground truth clicks (800Hz beat, 1200Hz downbeat) - `track_XXX_both.wav` - Original audio + both prediction and ground truth clicks ### Summary Plot Generate bar charts summarizing F1 scores and continuity metrics: ```bash uv run -m exp.baseline1.eval --summary-plot ``` Output: `outputs/eval/evaluation_summary.png` --- ## Models ### Baseline 1: ODCNN A 10-year-old baseline model: . The original baseline implements the **Onset Detection CNN (ODCNN)** architecture: #### Architecture - **Input**: Multi-view mel spectrogram (3 window sizes: 23ms, 46ms, 93ms) - **CNN Backbone**: 3 convolutional blocks with max pooling - **Output**: Frame-level beat/downbeat probability - **Inference**: ±7 frames context (±70ms) ### Baseline 2: ResNet-SE Inspired by ResNet-SE: . A modernized architecture designed to capture longer temporal context: #### Architecture - **Input**: Mel spectrogram with larger context - **Backbone**: ResNet with Squeeze-and-Excitation (SE) blocks - **Context**: **±50 frames (~1s)** window - **Features**: Deeper network (4 stages) with effective channel attention - **Parameters**: ~400k (Small & Efficient) ### Training Details Both models use similar training loops: - **Optimizer**: SGD (Baseline 1) / AdamW (Baseline 2) - **Learning Rate**: Cosine annealing - **Loss**: Binary Cross-Entropy - **Epochs**: 50 (Baseline 1) / 3 (Baseline 2) - **Batch Size**: 512 (Baseline 1) / 128 (Baseline 2) --- ## Project Structure ``` exp-onset/ ├── exp/ │ ├── baseline1/ # Baseline 1 (ODCNN) │ │ ├── model.py # ODCNN architecture │ │ ├── train.py │ │ ├── eval.py │ │ ├── data.py │ │ └── utils.py │ ├── baseline2/ # Baseline 2 (ResNet-SE) │ │ ├── model.py # ResNet-SE │ │ ├── train.py │ │ ├── eval.py │ │ └── data.py │ └── data/ │ ├── load.py # Dataset loading & preprocessing │ ├── eval.py # Evaluation metrics (F1, CML, AML) │ ├── audio.py # Click track synthesis │ └── viz.py # Visualization utilities ├── outputs/ │ ├── baseline1/ # Trained models (Baseline 1) │ ├── baseline2/ # Trained models (Baseline 2) │ └── eval/ # Evaluation outputs │ ├── plots/ # Visualization images │ ├── audio/ # Click track audio files │ └── evaluation_summary.png ├── README.md ├── DATASET.md # Raw dataset specification └── pyproject.toml ``` --- ## License This project is for research and educational purposes. The dataset is derived from publicly available rhythm game charts.