DeepMath: A Lightweight Math Reasoning Agent

An LLM is using a calculator to answer questions.

Model Description

DeepMath is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on Qwen3-4B Thinking and trained with GRPO (Group Relative Policy Optimization), DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length.

Developed by: Intel AI Labs
Model type: Causal language model with agent capabilities
Language: English
Base model: Qwen3-4B Thinking
License: Apache 2.0
Repository: https://github.com/IntelLabs/DeepMath

Key Features

✅ Code-driven reasoning: Generates short Python snippets for intermediate computational steps
✅ Sandboxed execution: No file I/O, no network calls, strict timeouts
✅ Improved accuracy: Offloading computation reduces arithmetic errors
✅ Reduced verbosity: Up to 66% shorter outputs compared to baseline
✅ Safe and auditable: Deterministic execution with readable code snippets

Model Architecture

DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components:

Agent Interface: Outputs special tokens for Python code execution during reasoning
Executor: Sandboxed Python environment with allow-listed modules
Safety Constraints: Per-snippet timeouts, no file/network access
Training Method: GRPO with accuracy and code generation rewards

Changes to vLLM client and server in TRL library. — *Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.*

Training Details

Training Data

Dataset: OpenMathReasoning (tool-usage subset)
Note: GRPO training only uses problems, not solutions
In-context Learning: 4 solved examples demonstrating agent call syntax and patterns

Training Procedure

GRPO (Group Relative Policy Optimization) fine-tuning with:

Accuracy Reward: +1 for correct answers
Code Generation Reward: +1 for using code snippets (weighted 10:1 vs. accuracy)
Length Constraint: GRPO completions limited to 5k tokens
Temperature Scheduling: Linear schedule from T=1.2 → T=0.7 during training
Infrastructure: Modified TRL library's vLLM client and server

Training Infrastructure

Base inference engine: vLLM
Agent framework: Based on SmolAgents
Training framework: Modified TRL GRPO trainer

Performance

Benchmark Results

We evaluated DeepMath on four mathematical reasoning datasets using majority@16 and mean output length metrics:

Main results table showing performance across MATH500, AIME, HMMT, and HLE datasets.

Key Findings:

Accuracy: Improved performance on challenging datasets (AIME, HMMT, HLE)
Efficiency: Up to 66% reduction in output length
Robustness: Consistent improvements when combining agent + GRPO training

Evaluation Datasets

MATH500: Subset of the MATH dataset
AIME: American Invitational Mathematics Examination problems
HMMT: Harvard-MIT Mathematics Tournament problems
HLE: High-level exam problems

Output example showing Python code generation and execution. — *Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.*

Usage

Installation

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone https://github.com/IntelLabs/DeepMath.git
cd DeepMath

# Install dependencies
uv pip install -r requirements.txt
uv pip install -e .

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/deepmath-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example problem
problem = "What is the sum of the first 100 positive integers?"

inputs = tokenizer(problem, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=3000)
print(tokenizer.decode(outputs[0]))

Inference with Agent

For full agent capabilities with sandboxed Python execution:

python inference.py \
    +model.use_vllm=true \
    +model.math_agent=true \
    +model.examples=deep_math/fewshot.txt \
    model.generation.max_new_tokens=3000 \
    +model.max_agent_output=20000 \
    +model.max_steps=50 \
    model.model_name_or_path=Intel/deepmath-v1 \
    hf_tag=HuggingFaceH4/MATH-500 \
    generated_file=output.jsonl

See the repository for complete usage examples.

Limitations and Biases

Limitations

Scope: Optimized for mathematical reasoning tasks; may not generalize to other domains
Problem Types: Evaluated on contest-style math problems; performance on open-ended mathematical creativity or formal proofs is unknown
Model Size: 4B parameters may limit reasoning depth on extremely complex problems
Code Execution: Requires sandboxed environment for full agent capabilities

Safety Considerations

⚠️ Code Execution Risk: This model generates and executes Python code. While DeepMath uses strict sandboxing and resource limits, any deployment should:

Carefully manage attack surfaces
Enforce rate limits
Use proper isolation (containers, VMs)
Monitor resource usage
Validate generated code before execution in production

Ethical Considerations

The model is trained on mathematical problem-solving datasets and should not be used for decision-making in critical applications without human oversight
Generated code should be reviewed before execution in production environments
The model may reflect biases present in the training data

Citation

If you use DeepMath in your research, please cite:

@software{deepmath2025,
  author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
  title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
  year = {2025},
  publisher = {Intel AI Labs},
  url = {https://github.com/IntelLabs/DeepMath}
}

Model Card Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month: 19

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Intel/deepmath-v1

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

(116)

this model

Intel
/

deepmath-v1