MLX Studio

MLX Studio App

MLX Studio — the only app that natively supports JANG models with reasoning


IMO Gold Medal reasoning in 17 GB. Nemotron-Cascade-2 achieves 88% MMLU with reasoning at just 17 GB — fits on 16 GB MacBooks. Hybrid Mamba-2 SSM + MoE + Attention. Only 6 KV cache attention layers = minimal memory at long context.

LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or pip install "jang[mlx]>=2.1.5".


JANG

Nemotron-Cascade-2-30B-A3B — JANG_4M (4.1-bit, 8-bit attention) — Reasoning

JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX


GitHub  PyPI  Website  X/Twitter

JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.

Key Features

  • 93.0% MMLU (200 questions, reasoning mode) — IMO Gold Medal model in 17 GB
  • 55 tok/s generation, 154 tok/s prefill
  • 10.3 GB on disk, 10.3 GB GPU RAM (peak 11.1 GB)
  • Reasoning mode: <think>...</think> step-by-step problem solving
  • Tiny KV cache: only 6 attention layers, 0.2 GB at 32K context
  • Hybrid architecture: Mamba-2 SSM + MoE (128 experts, top-6) + Attention

Results: JANG vs MLX (200-question MMLU)

Per-subject comparison. All models tested with and without reasoning.

Subject JANG_2L No-Think JANG_2L Reasoning JANG_4M No-Think JANG_4M Reasoning MLX 4-bit No-Think MLX 4-bit Reasoning MLX 6-bit No-Think MLX 6-bit Reasoning
Abstract Algebra 4/20 15/20 9/20 19/20 8/20 18/20 7/20 19/20
Anatomy 13/20 17/20 15/20 19/20 14/20 18/20 17/20 19/20
Astronomy 17/20 19/20 18/20 20/20 17/20 19/20 19/20 20/20
College CS 7/20 17/20 10/20 18/20 11/20 17/20 11/20 17/20
College Physics 13/20 20/20 14/20 19/20 15/20 20/20 14/20 20/20
HS Biology 16/20 19/20 18/20 20/20 18/20 20/20 18/20 20/20
HS Chemistry 12/20 19/20 14/20 19/20 13/20 19/20 17/20 19/20
HS Mathematics 8/20 15/20 8/20 18/20 10/20 19/20 8/20 20/20
Logical Fallacies 12/20 18/20 14/20 16/20 14/20 17/20 13/20 17/20
World Religions 16/20 17/20 18/20 18/20 18/20 18/20 18/20 18/20
Total 118/200 (59.0%) 176/200 (88.0%) 138/200 (69.0%) 186/200 (93.0%) 138/200 (69.0%) 185/200 (92.5%) 142/200 (71.0%) 189/200 (94.5%)

Summary

JANG_2L JANG_4M MLX 4-bit MLX 6-bit
MMLU (no-think) 59.0% 69.0% 69.0% 71.0%
MMLU (reasoning) 88.0% 93.0% 92.5% 94.5%
Size 10.3 GB 17 GB 16.6 GB 23.9 GB
GPU RAM 10.3 GB 17 GB ~17 GB ~24 GB
Speed 55 tok/s
Fits 24 GB? YES NO NO NO

JANG_2L is the only quantization that fits 24 GB Macs while delivering 88% MMLU with reasoning. JANG_4M beats MLX 4-bit (93.0% vs 92.5%) at the same 17 GB size.

Also see: JANG_2L (10 GB) — fits 16 GB Macs, 88% reasoning MMLU.

Specs

Metric Value
Source Nemotron-Cascade-2-30B-A3B
Architecture Hybrid Mamba-2 SSM + MoE + Dense Attention
Layers 52 (Mamba-2 + MoE + 6 Attention)
Experts 128 per MoE layer, top-6 active (3B active params)
KV cache 6 attention layers, 2 KV heads, 128 dim — 0.2 GB at 32K context
Profile JANG_2L (CRITICAL=8, IMPORTANT=6, COMPRESS=2)
Average bits 4.12 bpw
Disk size 10.3 GB
GPU RAM 10.3 GB (peak 11.1 GB)
Speed 55 tok/s generation, 154 tok/s prefill

Requirements

  • Apple Silicon Mac with 24+ GB unified memory
  • MLX Studio or pip install "jang[mlx]>=2.1.5"

Quick Start

pip install "jang[mlx]>=2.1.5"
from jang_tools.loader import load_jang_model
from mlx_lm import generate

model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-Cascade-2-30B-A3B-JANG_4M")

# With reasoning (recommended)
messages = [{"role": "user", "content": "Solve: what is the integral of x^2 * e^x?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=True)
result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

# Without reasoning (faster)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=False)
result = generate(model, tokenizer, prompt=prompt, max_tokens=100)

Technical Notes

  • Mamba-2 SSM: Most layers use state-space models, enabling efficient long-context with minimal KV cache.
  • Only 6 attention layers: KV cache is tiny (0.2 GB at 32K). Most models use 25-100% attention layers.
  • nemotron_h architecture: Requires JANG loader for proper weight mapping. Standard mlx-lm has incomplete support.
  • IMO Gold Medal: This model achieves competition-level mathematical reasoning at 30B scale.

JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace

Downloads last month
739
Safetensors
Model size
5B params
Tensor type
U32
·
F16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JANGQ-AI/Nemotron-Cascade-2-30B-A3B-JANG_4M

Finetuned
(6)
this model