RLM-Qwen3.5-35B-A3B

A natively recursive language model based on Qwen3.5-35B-A3B, trained with Rejection Sampling SFT (RS-SFT) to solve long-context tasks by writing Python code in a persistent REPL.

Model Description

This model is a LoRA fine-tune (rank 32) of Qwen/Qwen3.5-35B-A3B trained on 3,644 mined correct trajectories to generate code that decomposes long-context tasks into manageable sub-problems.

Architecture: Mixture-of-Experts, 35B total parameters, 3B active per token Training: RS-SFT (Rejection Sampling SFT) on iteratively mined trajectories Training API: Tinker (remote GPU training)

How It Works

The model operates within an RLM (Recursive Language Model) scaffold:

The full input is stored as a context variable in a Python REPL — it never enters the model's context window
The model sees only metadata about the context (length, prefix, available functions)
The model writes Python code to chunk, search, and aggregate over the context
llm_query(text) recursively invokes the model on substrings
FINAL(answer) terminates with the answer

This enables processing of millions of tokens with a 32K context window.

Training Details

RS-SFT (Rejection Sampling SFT)

Standard GRPO for RLM training creates a zero-sum tradeoff: gains on some benchmarks cause regressions on others due to negative gradients suppressing useful base-model patterns. RS-SFT avoids this entirely by training only on correct, high-quality trajectories with standard cross-entropy loss — no negative signal, no "push away from" anything.

Key finding: RS-SFT outperforms Strategy-Conditioned GRPO by +10.4pp average. GRPO on top of RS-SFT actively hurts performance.

Training Data

3,644 trajectories mined from evaluation runs across 14 benchmarks:

Iterative mining: each RS-SFT round produces a better model that generates better trajectories for the next round
Content-based deduplication + task-balanced batching
Quality filtering: correct answer (score >= 0.9), proper FINAL() termination, uses llm_query(), clean code

Training Configuration

Base model: Qwen/Qwen3.5-35B-A3B
LoRA rank: 32 (train MLP + attention + unembed)
Learning rate: 2e-5 (cosine decay, 10% warmup)
Epochs: 3 over the filtered dataset
Batch size: 4 with gradient accumulation of 4 = effective batch 16
Training time: ~2.8 hours on Tinker API

Evaluation Results

Evaluated on 14 benchmarks spanning search, extraction, comparison, reasoning, and counting.

Benchmark (N)	Base	RLM-V17	Delta
NIAH (20)	65.0%	75.0%	+10.0
Multi-NIAH (20)	99.4%	95.1%	-4.3
Hard NIAH (15)	83.3%	96.7%	+13.4
Doc-Classify (20)	56.3%	80.2%	+23.9
DataFrame QA (20)	75.0%	93.0%	+18.0
Code Debug (15)	50.0%	60.0%	+10.0
Multi-Hop QA (20)	55.0%	80.0%	+25.0
Hard Multi-Hop (10)	30.0%	60.0%	+30.0
Notebook QA (15)	46.7%	70.0%	+23.3
Event Counting (20)	46.4%	73.0%	+26.6
Cross-Doc Compare (12)	42.2%	43.4%	+1.2
Key-Value Retrieval (12)	29.2%	85.4%	+56.2
Verbatim Copy (10)	20.0%	80.0%	+60.0
OOLONG (10)	0.0%	10.0%	+10.0
Average (14)	49.9%	71.6%	+21.7

Record: 13 wins, 1 loss vs base. Only regression is Multi-NIAH (99.4% → 95.1%, still very high).

Key Findings

+21.7pp average improvement across 14 benchmarks
Massive gains on retrieval: Key-Value +56pp, Verbatim Copy +60pp, Hard Multi-Hop +30pp
RS-SFT > GRPO: +10.4pp average over Strategy-Conditioned GRPO (V10-s40)
Iterative mining works: V16 (2,589 samples) → V17 (3,644 samples) = +1.2pp more

Limitations

Requires RLM scaffold: The model is designed to run within the RLM loop with REPL access — direct prompting will not produce RLM behavior
Temperature sensitivity: Evaluated at temperature 0.7; results may vary at other temperatures
REPL security: The model executes arbitrary Python code in a sandboxed REPL — not suitable for untrusted inputs without additional sandboxing
Small evaluation sets: N=10-20 per benchmark; individual benchmark deltas may not be statistically significant

Usage

from scaffold.rlm import run_rlm
from scaffold.llm_query import TinkerModel

model = TinkerModel("Qwen/Qwen3.5-35B-A3B", model_path="path/to/checkpoint")
result = run_rlm(model, question="Find the secret code", context=long_document)

See the repository for full setup instructions.

Citation

@misc{abulhassan2026rlm35b,
  title={Training Natively Recursive Language Models: RS-SFT for Long-Context Code Generation},
  author={Omar Abul-Hassan and Miguel Villanueva and Josh Bowden},
  year={2026},
  howpublished={CS234 Final Project, Stanford University}
}

References

Zhang, Kraska, Khattab (2026). Recursive Language Models. arXiv:2502.14155
Guo et al. (2025). DeepSeek-R1. arXiv:2501.12948

Downloads last month: 44

Model tree for omar81939/rlm-qwen35-35b-a3b

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Adapter

(18)

this model

Papers for omar81939/rlm-qwen35-35b-a3b

Giving AI Personalities Leads to More Human-Like Reasoning

Paper • 2502.14155 • Published Feb 19, 2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22, 2025 • 444