MLX Studio — the only app that natively supports JANG models with reasoning
IMO Gold Medal reasoning in 17 GB. Nemotron-Cascade-2 achieves 88% MMLU with reasoning at just 17 GB — fits on 16 GB MacBooks. Hybrid Mamba-2 SSM + MoE + Attention. Only 6 KV cache attention layers = minimal memory at long context.
LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or
pip install "jang[mlx]>=2.1.5".
Nemotron-Cascade-2-30B-A3B — JANG_4M (4.1-bit, 8-bit attention) — Reasoning
JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX
JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.
Key Features
- 93.0% MMLU (200 questions, reasoning mode) — IMO Gold Medal model in 17 GB
- 55 tok/s generation, 154 tok/s prefill
- 10.3 GB on disk, 10.3 GB GPU RAM (peak 11.1 GB)
- Reasoning mode:
<think>...</think>step-by-step problem solving - Tiny KV cache: only 6 attention layers, 0.2 GB at 32K context
- Hybrid architecture: Mamba-2 SSM + MoE (128 experts, top-6) + Attention
Results: JANG vs MLX (200-question MMLU)
Per-subject comparison. All models tested with and without reasoning.
| Subject | JANG_2L No-Think | JANG_2L Reasoning | JANG_4M No-Think | JANG_4M Reasoning | MLX 4-bit No-Think | MLX 4-bit Reasoning | MLX 6-bit No-Think | MLX 6-bit Reasoning |
|---|---|---|---|---|---|---|---|---|
| Abstract Algebra | 4/20 | 15/20 | 9/20 | 19/20 | 8/20 | 18/20 | 7/20 | 19/20 |
| Anatomy | 13/20 | 17/20 | 15/20 | 19/20 | 14/20 | 18/20 | 17/20 | 19/20 |
| Astronomy | 17/20 | 19/20 | 18/20 | 20/20 | 17/20 | 19/20 | 19/20 | 20/20 |
| College CS | 7/20 | 17/20 | 10/20 | 18/20 | 11/20 | 17/20 | 11/20 | 17/20 |
| College Physics | 13/20 | 20/20 | 14/20 | 19/20 | 15/20 | 20/20 | 14/20 | 20/20 |
| HS Biology | 16/20 | 19/20 | 18/20 | 20/20 | 18/20 | 20/20 | 18/20 | 20/20 |
| HS Chemistry | 12/20 | 19/20 | 14/20 | 19/20 | 13/20 | 19/20 | 17/20 | 19/20 |
| HS Mathematics | 8/20 | 15/20 | 8/20 | 18/20 | 10/20 | 19/20 | 8/20 | 20/20 |
| Logical Fallacies | 12/20 | 18/20 | 14/20 | 16/20 | 14/20 | 17/20 | 13/20 | 17/20 |
| World Religions | 16/20 | 17/20 | 18/20 | 18/20 | 18/20 | 18/20 | 18/20 | 18/20 |
| Total | 118/200 (59.0%) | 176/200 (88.0%) | 138/200 (69.0%) | 186/200 (93.0%) | 138/200 (69.0%) | 185/200 (92.5%) | 142/200 (71.0%) | 189/200 (94.5%) |
Summary
| JANG_2L | JANG_4M | MLX 4-bit | MLX 6-bit | |
|---|---|---|---|---|
| MMLU (no-think) | 59.0% | 69.0% | 69.0% | 71.0% |
| MMLU (reasoning) | 88.0% | 93.0% | 92.5% | 94.5% |
| Size | 10.3 GB | 17 GB | 16.6 GB | 23.9 GB |
| GPU RAM | 10.3 GB | 17 GB | ~17 GB | ~24 GB |
| Speed | 55 tok/s | — | — | — |
| Fits 24 GB? | YES | NO | NO | NO |
JANG_2L is the only quantization that fits 24 GB Macs while delivering 88% MMLU with reasoning. JANG_4M beats MLX 4-bit (93.0% vs 92.5%) at the same 17 GB size.
Also see: JANG_2L (10 GB) — fits 16 GB Macs, 88% reasoning MMLU.
Specs
| Metric | Value |
|---|---|
| Source | Nemotron-Cascade-2-30B-A3B |
| Architecture | Hybrid Mamba-2 SSM + MoE + Dense Attention |
| Layers | 52 (Mamba-2 + MoE + 6 Attention) |
| Experts | 128 per MoE layer, top-6 active (3B active params) |
| KV cache | 6 attention layers, 2 KV heads, 128 dim — 0.2 GB at 32K context |
| Profile | JANG_2L (CRITICAL=8, IMPORTANT=6, COMPRESS=2) |
| Average bits | 4.12 bpw |
| Disk size | 10.3 GB |
| GPU RAM | 10.3 GB (peak 11.1 GB) |
| Speed | 55 tok/s generation, 154 tok/s prefill |
Requirements
- Apple Silicon Mac with 24+ GB unified memory
- MLX Studio or
pip install "jang[mlx]>=2.1.5"
Quick Start
pip install "jang[mlx]>=2.1.5"
from jang_tools.loader import load_jang_model
from mlx_lm import generate
model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-Cascade-2-30B-A3B-JANG_4M")
# With reasoning (recommended)
messages = [{"role": "user", "content": "Solve: what is the integral of x^2 * e^x?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True, enable_thinking=True)
result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
# Without reasoning (faster)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True, enable_thinking=False)
result = generate(model, tokenizer, prompt=prompt, max_tokens=100)
Technical Notes
- Mamba-2 SSM: Most layers use state-space models, enabling efficient long-context with minimal KV cache.
- Only 6 attention layers: KV cache is tiny (0.2 GB at 32K). Most models use 25-100% attention layers.
- nemotron_h architecture: Requires JANG loader for proper weight mapping. Standard mlx-lm has incomplete support.
- IMO Gold Medal: This model achieves competition-level mathematical reasoning at 30B scale.
JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace
- Downloads last month
- 739
Quantized
Model tree for JANGQ-AI/Nemotron-Cascade-2-30B-A3B-JANG_4M
Base model
nvidia/Nemotron-Cascade-2-30B-A3B
