RLM-Qwen3.5-35B-A3B
A natively recursive language model based on Qwen3.5-35B-A3B, trained with Rejection Sampling SFT (RS-SFT) to solve long-context tasks by writing Python code in a persistent REPL.
Model Description
This model is a LoRA fine-tune (rank 32) of Qwen/Qwen3.5-35B-A3B trained on 3,644 mined correct trajectories to generate code that decomposes long-context tasks into manageable sub-problems.
Architecture: Mixture-of-Experts, 35B total parameters, 3B active per token Training: RS-SFT (Rejection Sampling SFT) on iteratively mined trajectories Training API: Tinker (remote GPU training)
How It Works
The model operates within an RLM (Recursive Language Model) scaffold:
- The full input is stored as a
contextvariable in a Python REPL — it never enters the model's context window - The model sees only metadata about the context (length, prefix, available functions)
- The model writes Python code to chunk, search, and aggregate over the context
llm_query(text)recursively invokes the model on substringsFINAL(answer)terminates with the answer
This enables processing of millions of tokens with a 32K context window.
Training Details
RS-SFT (Rejection Sampling SFT)
Standard GRPO for RLM training creates a zero-sum tradeoff: gains on some benchmarks cause regressions on others due to negative gradients suppressing useful base-model patterns. RS-SFT avoids this entirely by training only on correct, high-quality trajectories with standard cross-entropy loss — no negative signal, no "push away from" anything.
Key finding: RS-SFT outperforms Strategy-Conditioned GRPO by +10.4pp average. GRPO on top of RS-SFT actively hurts performance.
Training Data
3,644 trajectories mined from evaluation runs across 14 benchmarks:
- Iterative mining: each RS-SFT round produces a better model that generates better trajectories for the next round
- Content-based deduplication + task-balanced batching
- Quality filtering: correct answer (score >= 0.9), proper FINAL() termination, uses llm_query(), clean code
Training Configuration
- Base model: Qwen/Qwen3.5-35B-A3B
- LoRA rank: 32 (train MLP + attention + unembed)
- Learning rate: 2e-5 (cosine decay, 10% warmup)
- Epochs: 3 over the filtered dataset
- Batch size: 4 with gradient accumulation of 4 = effective batch 16
- Training time: ~2.8 hours on Tinker API
Evaluation Results
Evaluated on 14 benchmarks spanning search, extraction, comparison, reasoning, and counting.
| Benchmark (N) | Base | RLM-V17 | Delta |
|---|---|---|---|
| NIAH (20) | 65.0% | 75.0% | +10.0 |
| Multi-NIAH (20) | 99.4% | 95.1% | -4.3 |
| Hard NIAH (15) | 83.3% | 96.7% | +13.4 |
| Doc-Classify (20) | 56.3% | 80.2% | +23.9 |
| DataFrame QA (20) | 75.0% | 93.0% | +18.0 |
| Code Debug (15) | 50.0% | 60.0% | +10.0 |
| Multi-Hop QA (20) | 55.0% | 80.0% | +25.0 |
| Hard Multi-Hop (10) | 30.0% | 60.0% | +30.0 |
| Notebook QA (15) | 46.7% | 70.0% | +23.3 |
| Event Counting (20) | 46.4% | 73.0% | +26.6 |
| Cross-Doc Compare (12) | 42.2% | 43.4% | +1.2 |
| Key-Value Retrieval (12) | 29.2% | 85.4% | +56.2 |
| Verbatim Copy (10) | 20.0% | 80.0% | +60.0 |
| OOLONG (10) | 0.0% | 10.0% | +10.0 |
| Average (14) | 49.9% | 71.6% | +21.7 |
Record: 13 wins, 1 loss vs base. Only regression is Multi-NIAH (99.4% → 95.1%, still very high).
Key Findings
- +21.7pp average improvement across 14 benchmarks
- Massive gains on retrieval: Key-Value +56pp, Verbatim Copy +60pp, Hard Multi-Hop +30pp
- RS-SFT > GRPO: +10.4pp average over Strategy-Conditioned GRPO (V10-s40)
- Iterative mining works: V16 (2,589 samples) → V17 (3,644 samples) = +1.2pp more
Limitations
- Requires RLM scaffold: The model is designed to run within the RLM loop with REPL access — direct prompting will not produce RLM behavior
- Temperature sensitivity: Evaluated at temperature 0.7; results may vary at other temperatures
- REPL security: The model executes arbitrary Python code in a sandboxed REPL — not suitable for untrusted inputs without additional sandboxing
- Small evaluation sets: N=10-20 per benchmark; individual benchmark deltas may not be statistically significant
Usage
from scaffold.rlm import run_rlm
from scaffold.llm_query import TinkerModel
model = TinkerModel("Qwen/Qwen3.5-35B-A3B", model_path="path/to/checkpoint")
result = run_rlm(model, question="Find the secret code", context=long_document)
See the repository for full setup instructions.
Citation
@misc{abulhassan2026rlm35b,
title={Training Natively Recursive Language Models: RS-SFT for Long-Context Code Generation},
author={Omar Abul-Hassan and Miguel Villanueva and Josh Bowden},
year={2026},
howpublished={CS234 Final Project, Stanford University}
}
References
- Zhang, Kraska, Khattab (2026). Recursive Language Models. arXiv:2502.14155
- Guo et al. (2025). DeepSeek-R1. arXiv:2501.12948
- Downloads last month
- 44