JANGQ-AI/Nemotron-Cascade-2-30B-A3B-JANG_4M

MLX Studio — the only app that natively supports JANG models with reasoning

IMO Gold Medal reasoning in 17 GB. Nemotron-Cascade-2 achieves 88% MMLU with reasoning at just 17 GB — fits on 16 GB MacBooks. Hybrid Mamba-2 SSM + MoE + Attention. Only 6 KV cache attention layers = minimal memory at long context.

LM Studio, Ollama, oMLX do NOT support JANG format. Use MLX Studio or pip install "jang[mlx]>=2.1.5".

Nemotron-Cascade-2-30B-A3B — JANG_4M (4.1-bit, 8-bit attention) — Reasoning

JANG — Jang Adaptive N-bit Grading | The GGUF Equivalent for MLX

JANG is fully open-source. Quantization engine, research, and full commit history: github.com/jjang-ai/jangq. Created by Jinho Jang.

Key Features

93.0% MMLU (200 questions, reasoning mode) — IMO Gold Medal model in 17 GB
55 tok/s generation, 154 tok/s prefill
10.3 GB on disk, 10.3 GB GPU RAM (peak 11.1 GB)
Reasoning mode: <think>...</think> step-by-step problem solving
Tiny KV cache: only 6 attention layers, 0.2 GB at 32K context
Hybrid architecture: Mamba-2 SSM + MoE (128 experts, top-6) + Attention

Results: JANG vs MLX (200-question MMLU)

Per-subject comparison. All models tested with and without reasoning.

Subject	JANG_2L No-Think	JANG_2L Reasoning	JANG_4M No-Think	JANG_4M Reasoning	MLX 4-bit No-Think	MLX 4-bit Reasoning	MLX 6-bit No-Think	MLX 6-bit Reasoning
Abstract Algebra	4/20	15/20	9/20	19/20	8/20	18/20	7/20	19/20
Anatomy	13/20	17/20	15/20	19/20	14/20	18/20	17/20	19/20
Astronomy	17/20	19/20	18/20	20/20	17/20	19/20	19/20	20/20
College CS	7/20	17/20	10/20	18/20	11/20	17/20	11/20	17/20
College Physics	13/20	20/20	14/20	19/20	15/20	20/20	14/20	20/20
HS Biology	16/20	19/20	18/20	20/20	18/20	20/20	18/20	20/20
HS Chemistry	12/20	19/20	14/20	19/20	13/20	19/20	17/20	19/20
HS Mathematics	8/20	15/20	8/20	18/20	10/20	19/20	8/20	20/20
Logical Fallacies	12/20	18/20	14/20	16/20	14/20	17/20	13/20	17/20
World Religions	16/20	17/20	18/20	18/20	18/20	18/20	18/20	18/20
Total	118/200 (59.0%)	176/200 (88.0%)	138/200 (69.0%)	186/200 (93.0%)	138/200 (69.0%)	185/200 (92.5%)	142/200 (71.0%)	189/200 (94.5%)

Summary

	JANG_2L	JANG_4M	MLX 4-bit	MLX 6-bit
MMLU (no-think)	59.0%	69.0%	69.0%	71.0%
MMLU (reasoning)	88.0%	93.0%	92.5%	94.5%
Size	10.3 GB	17 GB	16.6 GB	23.9 GB
GPU RAM	10.3 GB	17 GB	~17 GB	~24 GB
Speed	55 tok/s	—	—	—
Fits 24 GB?	YES	NO	NO	NO

JANG_2L is the only quantization that fits 24 GB Macs while delivering 88% MMLU with reasoning. JANG_4M beats MLX 4-bit (93.0% vs 92.5%) at the same 17 GB size.

Also see: JANG_2L (10 GB) — fits 16 GB Macs, 88% reasoning MMLU.

Specs

Metric	Value
Source	Nemotron-Cascade-2-30B-A3B
Architecture	Hybrid Mamba-2 SSM + MoE + Dense Attention
Layers	52 (Mamba-2 + MoE + 6 Attention)
Experts	128 per MoE layer, top-6 active (3B active params)
KV cache	6 attention layers, 2 KV heads, 128 dim — 0.2 GB at 32K context
Profile	JANG_2L (CRITICAL=8, IMPORTANT=6, COMPRESS=2)
Average bits	4.12 bpw
Disk size	10.3 GB
GPU RAM	10.3 GB (peak 11.1 GB)
Speed	55 tok/s generation, 154 tok/s prefill

Requirements

Apple Silicon Mac with 24+ GB unified memory
MLX Studio or pip install "jang[mlx]>=2.1.5"

Quick Start

pip install "jang[mlx]>=2.1.5"

from jang_tools.loader import load_jang_model
from mlx_lm import generate

model, tokenizer = load_jang_model("JANGQ-AI/Nemotron-Cascade-2-30B-A3B-JANG_4M")

# With reasoning (recommended)
messages = [{"role": "user", "content": "Solve: what is the integral of x^2 * e^x?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=True)
result = generate(model, tokenizer, prompt=prompt, max_tokens=2048)

# Without reasoning (faster)
prompt = tokenizer.apply_chat_template(messages, tokenize=False,
    add_generation_prompt=True, enable_thinking=False)
result = generate(model, tokenizer, prompt=prompt, max_tokens=100)

Technical Notes

Mamba-2 SSM: Most layers use state-space models, enabling efficient long-context with minimal KV cache.
Only 6 attention layers: KV cache is tiny (0.2 GB at 32K). Most models use 25-100% attention layers.
nemotron_h architecture: Requires JANG loader for proper weight mapping. Standard mlx-lm has incomplete support.
IMO Gold Medal: This model achieves competition-level mathematical reasoning at 30B scale.

JANG — Created by Jinho Jang (eric@jangq.ai) · @dealignai
GitHub · PyPI · HuggingFace

Downloads last month: 739

Safetensors

Model size

5B params

Tensor type

U32

F16

MLX

Hardware compatibility

Quantized

Model tree for JANGQ-AI/Nemotron-Cascade-2-30B-A3B-JANG_4M

Base model

nvidia/Nemotron-Cascade-2-30B-A3B

Finetuned

(6)

this model