RLM-Qwen3.5-35B-A3B

A natively recursive language model based on Qwen3.5-35B-A3B, trained with Rejection Sampling SFT (RS-SFT) to solve long-context tasks by writing Python code in a persistent REPL.

Model Description

This model is a LoRA fine-tune (rank 32) of Qwen/Qwen3.5-35B-A3B trained on 3,644 mined correct trajectories to generate code that decomposes long-context tasks into manageable sub-problems.

Architecture: Mixture-of-Experts, 35B total parameters, 3B active per token Training: RS-SFT (Rejection Sampling SFT) on iteratively mined trajectories Training API: Tinker (remote GPU training)

How It Works

The model operates within an RLM (Recursive Language Model) scaffold:

  1. The full input is stored as a context variable in a Python REPL — it never enters the model's context window
  2. The model sees only metadata about the context (length, prefix, available functions)
  3. The model writes Python code to chunk, search, and aggregate over the context
  4. llm_query(text) recursively invokes the model on substrings
  5. FINAL(answer) terminates with the answer

This enables processing of millions of tokens with a 32K context window.

Training Details

RS-SFT (Rejection Sampling SFT)

Standard GRPO for RLM training creates a zero-sum tradeoff: gains on some benchmarks cause regressions on others due to negative gradients suppressing useful base-model patterns. RS-SFT avoids this entirely by training only on correct, high-quality trajectories with standard cross-entropy loss — no negative signal, no "push away from" anything.

Key finding: RS-SFT outperforms Strategy-Conditioned GRPO by +10.4pp average. GRPO on top of RS-SFT actively hurts performance.

Training Data

3,644 trajectories mined from evaluation runs across 14 benchmarks:

  • Iterative mining: each RS-SFT round produces a better model that generates better trajectories for the next round
  • Content-based deduplication + task-balanced batching
  • Quality filtering: correct answer (score >= 0.9), proper FINAL() termination, uses llm_query(), clean code

Training Configuration

  • Base model: Qwen/Qwen3.5-35B-A3B
  • LoRA rank: 32 (train MLP + attention + unembed)
  • Learning rate: 2e-5 (cosine decay, 10% warmup)
  • Epochs: 3 over the filtered dataset
  • Batch size: 4 with gradient accumulation of 4 = effective batch 16
  • Training time: ~2.8 hours on Tinker API

Evaluation Results

Evaluated on 14 benchmarks spanning search, extraction, comparison, reasoning, and counting.

Benchmark (N) Base RLM-V17 Delta
NIAH (20) 65.0% 75.0% +10.0
Multi-NIAH (20) 99.4% 95.1% -4.3
Hard NIAH (15) 83.3% 96.7% +13.4
Doc-Classify (20) 56.3% 80.2% +23.9
DataFrame QA (20) 75.0% 93.0% +18.0
Code Debug (15) 50.0% 60.0% +10.0
Multi-Hop QA (20) 55.0% 80.0% +25.0
Hard Multi-Hop (10) 30.0% 60.0% +30.0
Notebook QA (15) 46.7% 70.0% +23.3
Event Counting (20) 46.4% 73.0% +26.6
Cross-Doc Compare (12) 42.2% 43.4% +1.2
Key-Value Retrieval (12) 29.2% 85.4% +56.2
Verbatim Copy (10) 20.0% 80.0% +60.0
OOLONG (10) 0.0% 10.0% +10.0
Average (14) 49.9% 71.6% +21.7

Record: 13 wins, 1 loss vs base. Only regression is Multi-NIAH (99.4% → 95.1%, still very high).

Key Findings

  1. +21.7pp average improvement across 14 benchmarks
  2. Massive gains on retrieval: Key-Value +56pp, Verbatim Copy +60pp, Hard Multi-Hop +30pp
  3. RS-SFT > GRPO: +10.4pp average over Strategy-Conditioned GRPO (V10-s40)
  4. Iterative mining works: V16 (2,589 samples) → V17 (3,644 samples) = +1.2pp more

Limitations

  • Requires RLM scaffold: The model is designed to run within the RLM loop with REPL access — direct prompting will not produce RLM behavior
  • Temperature sensitivity: Evaluated at temperature 0.7; results may vary at other temperatures
  • REPL security: The model executes arbitrary Python code in a sandboxed REPL — not suitable for untrusted inputs without additional sandboxing
  • Small evaluation sets: N=10-20 per benchmark; individual benchmark deltas may not be statistically significant

Usage

from scaffold.rlm import run_rlm
from scaffold.llm_query import TinkerModel

model = TinkerModel("Qwen/Qwen3.5-35B-A3B", model_path="path/to/checkpoint")
result = run_rlm(model, question="Find the secret code", context=long_document)

See the repository for full setup instructions.

Citation

@misc{abulhassan2026rlm35b,
  title={Training Natively Recursive Language Models: RS-SFT for Long-Context Code Generation},
  author={Omar Abul-Hassan and Miguel Villanueva and Josh Bowden},
  year={2026},
  howpublished={CS234 Final Project, Stanford University}
}

References

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for omar81939/rlm-qwen35-35b-a3b

Adapter
(18)
this model

Papers for omar81939/rlm-qwen35-35b-a3b