For full information, go check out the Dr Tulu paper here.

Figure 1

DR Tulu 8B - MLX

This is DR Tulu 8B converted to MLX format for efficient inference on Apple Silicon hardware.

MLX Model Variants

All variants are optimized for Apple Silicon with different memory/performance trade-offs:

Model Precision Model Size Bits/Weight Memory Usage Performance Download
DR-Tulu-8B-MLX-4bit 4-bit quantized ~4.3GB 4.500 Lower 78.2 tok/s πŸ€— HF
DR-Tulu-8B-MLX-6bit 6-bit quantized ~6.2GB 6.500 Medium 60.7 tok/s πŸ€— HF
DR-Tulu-8B-MLX-8bit 8-bit quantized ~8.1GB 8.500 Medium-High 59.8 tok/s πŸ€— HF
DR-Tulu-8B-MLX-bf16 bfloat16 (full) ~15.3GB 16.000 High 35.0 tok/s πŸ€— HF

πŸ”₯ Key Features:

  • Original Model: rl-research/DR-Tulu-8B
  • Hardware Optimized: Apple Silicon (M1/M2/M3/M4/M5)
  • Conversion Framework: mlx-lm
  • Research-Grade Choice: bf16 provides maximum quality and capabilities with full precision
  • All variants maintain core research reasoning capabilities

πŸ”₯ MLX Conversion Details:

  • Original Model: rl-research/DR-Tulu-8B
  • Conversion: MLX format with bfloat16 precision (research-grade full precision)
  • Model Size: ~15.3GB (down from 16.4GB original)
  • Hardware Used: Mac Studio with Apple M1 Ultra (20-core, 128GB unified memory)
  • Conversion Framework: mlx-lm
  • Performance: ~35 tokens/sec, 16.4GB memory usage

Hardware Requirements

Variant Minimum RAM Recommended RAM Storage
4bit 8GB 16GB 5GB
6bit 16GB 24GB 7GB
8bit 16GB 32GB 9GB
bf16 24GB 32GB+ 16GB

Tested Hardware: Mac Studio with Apple M1 Ultra (20-core, 128GB unified memory)

MLX Quick Start

Command Line Interface

Install and run with uvx:

# Generate text (replace {VARIANT} with 4bit, 6bit, 8bit, or bf16)
uvx --from mlx-lm mlx_lm.generate --model Plurigrid/DR-Tulu-8B-MLX-{VARIANT} --prompt "What is categorical theory and how does it apply to computer science?" --max-tokens 200

# Interactive chat
uvx --from mlx-lm mlx_lm.chat --model Plurigrid/DR-Tulu-8B-MLX-{VARIANT}

Python API

from mlx_lm import load, generate

# Load model (replace {VARIANT} with 4bit, 6bit, 8bit, or bf16)
model, tokenizer = load("Plurigrid/DR-Tulu-8B-MLX-{VARIANT}")

prompt = "What is categorical theory and how does it apply to computer science?"

# Apply chat template if available
if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

# Generate response
response = generate(model, tokenizer, prompt=prompt, verbose=True)
print(response)

Installation for Python API:

pip install mlx-lm
# or with uv
uv add mlx-lm

Advanced Usage:

# For research tasks with step-by-step reasoning
prompt = "Analyze the relationship between category theory and functional programming. Think step by step."

# Multi-turn conversation
messages = [
    {"role": "user", "content": "What is category theory?"},
    {"role": "assistant", "content": "Category theory is a mathematical framework..."},
    {"role": "user", "content": "How does it apply to computer science?"}
]

if tokenizer.chat_template is not None:
    formatted_prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    response = generate(model, tokenizer, prompt=formatted_prompt, max_tokens=500)

About DR Tulu

This is the RL checkpoint of DR Tulu, an open deep research agent trained on top of rl-research/DR-Tulu-SFT-8B.

This model has undergone RL training on this dataset. For more details on DR Tulu please read our paper!

Inference and Usage

Note: The original model was trained for tool-use using the dr-agent-lib framework. This MLX version provides general inference capabilities optimized for Apple Silicon.

For advanced tool-use functionality, see our github or check out our demo!

Evaluation Results

Results from the original DR-Tulu-8B model:

Benchmark SQAv2 HealthBench ResearchQA DeepResearch Bench SimpleQA 2Wiki WebWalker Average
Qwen3-8B (naive rag) 40.4 16.5 56.1 33.3 52.6 18.9 8.8 32.4
Qwen3-8B (our search pipeline) 57.2 5.9 46.3 18.2 70.5 44.0 27.9 38.6
DR-Tulu-SFT-8B 72.3 38.1 68.5 39.0 75.5 66.5 31.9 56.0
DR-Tulu-8B (original) 86.7 43.7 71.1 41.8 80.1 68.0 39.1 61.5

For more baselines, explanations of this table, and analysis of results, check out the Dr Tulu paper!

Intended uses & limitations

This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.

MLX-specific considerations:

  • Optimized for Apple Silicon hardware only
  • bf16 precision maintains full model quality - research-grade full precision choice
  • Full precision preserved with minimal quality loss
  • Reasoning capabilities fully preserved across all variants

Training

The script used to train the original model can be found here.

For hyperparameter details, check out the Dr Tulu paper.

Links

Citation

@article{drtulu,
  title = {{DR Tulu:  Reinforcement Learning with Evolving Rubrics for Deep Research}},
  author = {{Rulin Shao, Akari Asai, Shannon Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Sam Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldani, Faeze Brahman, Scott Yih, Sherry Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hanna Hajishirzi, Pang Wei Koh}},
  year = {2025},
}

Conversion Details

  • Date: November 22, 2024
  • Converter: MLX community
  • Command: uvx --from mlx-lm mlx_lm.convert --hf-path rl-research/DR-Tulu-8B --mlx-path ./DR-Tulu-8B-bf16
  • Precision: bfloat16 (full precision MLX conversion)
  • Hardware: Mac Studio, Apple M1 Ultra (20-core CPU, 128GB unified memory)
  • OS: macOS Sequoia 15.2 (Darwin 25.2.0)
  • Framework Version: mlx-lm latest (November 2024)
Downloads last month
43
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Plurigrid/DR-Tulu-8B-MLX-bf16

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
(1)
this model