DeepMath: A Lightweight Math Reasoning Agent

An LLM is using a calculator to answer questions.

Model Description

DeepMath is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on Qwen3-4B Thinking and trained with GRPO (Group Relative Policy Optimization), DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length.

Key Features

Code-driven reasoning: Generates short Python snippets for intermediate computational steps
Sandboxed execution: No file I/O, no network calls, strict timeouts
Improved accuracy: Offloading computation reduces arithmetic errors
Reduced verbosity: Up to 66% shorter outputs compared to baseline
Safe and auditable: Deterministic execution with readable code snippets

Model Architecture

DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components:

  • Agent Interface: Outputs special tokens for Python code execution during reasoning
  • Executor: Sandboxed Python environment with allow-listed modules
  • Safety Constraints: Per-snippet timeouts, no file/network access
  • Training Method: GRPO with accuracy and code generation rewards
Changes to vLLM client and server in TRL library.

Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.

Training Details

Training Data

  • Dataset: OpenMathReasoning (tool-usage subset)
  • Note: GRPO training only uses problems, not solutions
  • In-context Learning: 4 solved examples demonstrating agent call syntax and patterns

Training Procedure

GRPO (Group Relative Policy Optimization) fine-tuning with:

  • Accuracy Reward: +1 for correct answers
  • Code Generation Reward: +1 for using code snippets (weighted 10:1 vs. accuracy)
  • Length Constraint: GRPO completions limited to 5k tokens
  • Temperature Scheduling: Linear schedule from T=1.2 → T=0.7 during training
  • Infrastructure: Modified TRL library's vLLM client and server

Training Infrastructure

  • Base inference engine: vLLM
  • Agent framework: Based on SmolAgents
  • Training framework: Modified TRL GRPO trainer

Performance

Benchmark Results

We evaluated DeepMath on four mathematical reasoning datasets using majority@16 and mean output length metrics:

Main results table showing performance across MATH500, AIME, HMMT, and HLE datasets.

Key Findings:

  • Accuracy: Improved performance on challenging datasets (AIME, HMMT, HLE)
  • Efficiency: Up to 66% reduction in output length
  • Robustness: Consistent improvements when combining agent + GRPO training

Evaluation Datasets

  • MATH500: Subset of the MATH dataset
  • AIME: American Invitational Mathematics Examination problems
  • HMMT: Harvard-MIT Mathematics Tournament problems
  • HLE: High-level exam problems
Output example showing Python code generation and execution.

Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.

Usage

Installation

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone https://github.com/IntelLabs/DeepMath.git
cd DeepMath

# Install dependencies
uv pip install -r requirements.txt
uv pip install -e .

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/deepmath-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example problem
problem = "What is the sum of the first 100 positive integers?"

inputs = tokenizer(problem, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=3000)
print(tokenizer.decode(outputs[0]))

Inference with Agent

For full agent capabilities with sandboxed Python execution:

python inference.py \
    +model.use_vllm=true \
    +model.math_agent=true \
    +model.examples=deep_math/fewshot.txt \
    model.generation.max_new_tokens=3000 \
    +model.max_agent_output=20000 \
    +model.max_steps=50 \
    model.model_name_or_path=Intel/deepmath-v1 \
    hf_tag=HuggingFaceH4/MATH-500 \
    generated_file=output.jsonl

See the repository for complete usage examples.

Limitations and Biases

Limitations

  • Scope: Optimized for mathematical reasoning tasks; may not generalize to other domains
  • Problem Types: Evaluated on contest-style math problems; performance on open-ended mathematical creativity or formal proofs is unknown
  • Model Size: 4B parameters may limit reasoning depth on extremely complex problems
  • Code Execution: Requires sandboxed environment for full agent capabilities

Safety Considerations

⚠️ Code Execution Risk: This model generates and executes Python code. While DeepMath uses strict sandboxing and resource limits, any deployment should:

  • Carefully manage attack surfaces
  • Enforce rate limits
  • Use proper isolation (containers, VMs)
  • Monitor resource usage
  • Validate generated code before execution in production

Ethical Considerations

  • The model is trained on mathematical problem-solving datasets and should not be used for decision-making in critical applications without human oversight
  • Generated code should be reviewed before execution in production environments
  • The model may reflect biases present in the training data

Citation

If you use DeepMath in your research, please cite:

@software{deepmath2025,
  author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
  title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
  year = {2025},
  publisher = {Intel AI Labs},
  url = {https://github.com/IntelLabs/DeepMath}
}

Model Card Contact

For questions or issues, please open an issue on the GitHub repository.

Downloads last month
19
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Intel/deepmath-v1

Finetuned
(116)
this model

Dataset used to train Intel/deepmath-v1