DeepMath: A Lightweight Math Reasoning Agent
Model Description
DeepMath is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on Qwen3-4B Thinking and trained with GRPO (Group Relative Policy Optimization), DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length.
- Developed by: Intel AI Labs
- Model type: Causal language model with agent capabilities
- Language: English
- Base model: Qwen3-4B Thinking
- License: Apache 2.0
- Repository: https://github.com/IntelLabs/DeepMath
Key Features
✅ Code-driven reasoning: Generates short Python snippets for intermediate computational steps
✅ Sandboxed execution: No file I/O, no network calls, strict timeouts
✅ Improved accuracy: Offloading computation reduces arithmetic errors
✅ Reduced verbosity: Up to 66% shorter outputs compared to baseline
✅ Safe and auditable: Deterministic execution with readable code snippets
Model Architecture
DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components:
- Agent Interface: Outputs special tokens for Python code execution during reasoning
- Executor: Sandboxed Python environment with allow-listed modules
- Safety Constraints: Per-snippet timeouts, no file/network access
- Training Method: GRPO with accuracy and code generation rewards
Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.
Training Details
Training Data
- Dataset: OpenMathReasoning (tool-usage subset)
- Note: GRPO training only uses problems, not solutions
- In-context Learning: 4 solved examples demonstrating agent call syntax and patterns
Training Procedure
GRPO (Group Relative Policy Optimization) fine-tuning with:
- Accuracy Reward: +1 for correct answers
- Code Generation Reward: +1 for using code snippets (weighted 10:1 vs. accuracy)
- Length Constraint: GRPO completions limited to 5k tokens
- Temperature Scheduling: Linear schedule from T=1.2 → T=0.7 during training
- Infrastructure: Modified TRL library's vLLM client and server
Training Infrastructure
- Base inference engine: vLLM
- Agent framework: Based on SmolAgents
- Training framework: Modified TRL GRPO trainer
Performance
Benchmark Results
We evaluated DeepMath on four mathematical reasoning datasets using majority@16 and mean output length metrics:
Key Findings:
- Accuracy: Improved performance on challenging datasets (AIME, HMMT, HLE)
- Efficiency: Up to 66% reduction in output length
- Robustness: Consistent improvements when combining agent + GRPO training
Evaluation Datasets
- MATH500: Subset of the MATH dataset
- AIME: American Invitational Mathematics Examination problems
- HMMT: Harvard-MIT Mathematics Tournament problems
- HLE: High-level exam problems
Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.
Usage
Installation
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone repository
git clone https://github.com/IntelLabs/DeepMath.git
cd DeepMath
# Install dependencies
uv pip install -r requirements.txt
uv pip install -e .
Basic Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Intel/deepmath-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example problem
problem = "What is the sum of the first 100 positive integers?"
inputs = tokenizer(problem, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=3000)
print(tokenizer.decode(outputs[0]))
Inference with Agent
For full agent capabilities with sandboxed Python execution:
python inference.py \
+model.use_vllm=true \
+model.math_agent=true \
+model.examples=deep_math/fewshot.txt \
model.generation.max_new_tokens=3000 \
+model.max_agent_output=20000 \
+model.max_steps=50 \
model.model_name_or_path=Intel/deepmath-v1 \
hf_tag=HuggingFaceH4/MATH-500 \
generated_file=output.jsonl
See the repository for complete usage examples.
Limitations and Biases
Limitations
- Scope: Optimized for mathematical reasoning tasks; may not generalize to other domains
- Problem Types: Evaluated on contest-style math problems; performance on open-ended mathematical creativity or formal proofs is unknown
- Model Size: 4B parameters may limit reasoning depth on extremely complex problems
- Code Execution: Requires sandboxed environment for full agent capabilities
Safety Considerations
⚠️ Code Execution Risk: This model generates and executes Python code. While DeepMath uses strict sandboxing and resource limits, any deployment should:
- Carefully manage attack surfaces
- Enforce rate limits
- Use proper isolation (containers, VMs)
- Monitor resource usage
- Validate generated code before execution in production
Ethical Considerations
- The model is trained on mathematical problem-solving datasets and should not be used for decision-making in critical applications without human oversight
- Generated code should be reviewed before execution in production environments
- The model may reflect biases present in the training data
Citation
If you use DeepMath in your research, please cite:
@software{deepmath2025,
author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
year = {2025},
publisher = {Intel AI Labs},
url = {https://github.com/IntelLabs/DeepMath}
}
Model Card Contact
For questions or issues, please open an issue on the GitHub repository.
- Downloads last month
- 19
Model tree for Intel/deepmath-v1
Base model
Qwen/Qwen3-4B-Thinking-2507