license: mit
pipeline_tag: text-generation
tags:
- chemistry
- molecular-generation
- qwen3
- mtp
- selfies
- cheminformatics
- sabrlo
๐งฌ ChemMiniQ3-SAbRLo (Synthetic Accessibility with Bioaware RL โ Optimized)
ChemMiniQ3-SAbRLo is a lightweight experimental generative model for chemistry, built on mini Qwen3, designed for rapid prototyping of HuggingFace AutoModel and AutoTokenizer compatibility, and fast iteration of Multi-Token Prediction (MTP) and RL fine-tuning algorithms/rewards.
It introduces a new reinforcement learning framework as the next iteration of ChemMiniQ3-HoriFIE, combining:
- ๐งฉ Synthetic Accessibility (SA) Rewards โ guiding generation with a classifier (
gbyuvd/synthaccess-chemselfies) to favor molecules that are easier to synthesize. - ๐ Cyclical Gradual Generation โ a curriculum learning strategy that gradually increases molecule length up to 25 tokens, then resets and repeats, enabling faster RL convergence and stable prototyping.
Prototype research code โ not production-ready. Built for speed, not scale - yet.
Example of a generated molecule, found no identical mol in PubChem
O=C(O)CC=1CCCCC=1C2=CC=CC(=C2)NC(=O)CC3=CC=CC=C3CCC
โ๏ธ Core Features
- โ
Qwen3 Mini Backbone โ Efficient causal LM architecture, compatible with
transformers.AutoModelForCausalLM - โ
Multi-Token Prediction (MTP Head) โ Parallel prediction of 1โ3 future tokens, implemented as a plug-and-play head compatible with
AutoModel - โ Horizon Loss โ Weighted multi-horizon objectives for long-term coherence
- โ SELFIES-native Tokenizer โ Robust encoding with FastChemTokenizer
- โ Ranger21 Optimizer โ Warmup/warmdown scheduling for stable training
- โ Gradient Checkpointing & Streaming Dataset Loader โ Lightweight, hardware-friendly, optimized for rapid RL prototyping
๐งช Reinforcement Learning Enhancements
1๏ธโฃ SA-Guided PPO-KL Fine-Tuning
- Uses
gbyuvd/synthaccess-chemselfiesas a reward model - Rewards molecules predicted as "Easy" to synthesize
- Penalizes molecules predicted as "Hard"
- Designed for rapid reward ablation: SA-only, ChemQ3-only, or mixed modes
- Tries to be compatible with HuggingFace
TrainerandPPOTrainerfor easy RL experimentation
2๏ธโฃ Symmetric Curriculum with Normalized Rewards
- Generation length increases and decreases smoothly: 10 โ 15 โ 20 โ 25 โ 20 โ 15 โ 10 โฆ
- Avoids sharp resets by cycling symmetrically instead of jumping from max back to min
- [Previous cyclical approach - now enhanced]: Gradually increases max generation length, but now uses symmetric cycling to avoid sharp transitions
- Rewards are normalized by sequence length (default: โlen) to stabilize training across different rollout sizes
- KL and entropy controllers are reset and recalibrated at each curriculum phase change
- Entropy targets scale with sequence length, encouraging consistent exploration at both short and long contexts
- Why 25? Because faster RL training requires shorter sequences to enable rapid iteration โ 25 tokens potentially could strike the optimal balance between structural complexity and training speed, allowing 2โ3x more gradient steps per epoch compared to 30+ token sequences
๐ก Note: The average SELFIES sequence length in our ~3M dataset is 33.41 ยฑ 1.80 tokens โ but for RL prototyping, we cap at 25 to accelerate training cycles and improve signal-to-noise in reward gradients.
3๏ธโฃ PPO, KL, and Entropy Stabilization
- PPO loss uses advantage clipping scaled with sequence length to prevent gradient spikes
- KL controller adapts ฮฒ more quickly and resets per curriculum update
- Entropy controller adjusts targets based on sequence length to balance exploration
๐ Why ChemMiniQ3-SAbRLo?
- Prior approaches optimized only validity or physicochemical rules (Lipinski, etc.)
- Our method explicitly biases generation toward molecules that are not just valid, but also easier to synthesize
- Extends beyond validity and rule-based rewards by explicitly biasing toward synthetically accessible molecules
- The symmetric curriculum + reward normalization improves stability across varying sequence lengths
- The cyclical gradual curriculum + 25-token cap potentially keeps training dynamic, avoids overfitting, and enables <1hr RL policy iterations on a single GPU
- Shorter capped lengths (โค25 tokens) allow faster iteration, enabling more frequent updates and practical RL prototyping
- Built from the ground up for (at least try to) HuggingFace AutoModel/AutoTokenizer compatibility
๐ก Target domain: molecular generation (SELFIES).
๐ฌ Goal: molecules that are valid, bioaware, and synthetically accessible.
๐ Core innovation: fast, modular prototyping of MTP + RL fine-tuning pipelines using standard HuggingFace components.
Usage
- See
demo_usage.ipynbor download it to use (I am still learning abt HF API so please be patient.) - For training, clone this repo:
- Customize config.json, run
train_withmtp.pyfor NTP-to-MTP training - Run
train_ppokl_withsa.pywith either "chemq3" (bioaware-only no SA), "sa" (SA-only no bioaware), or "mix" (combined rewards)
- Customize config.json, run
- Dataset for training NTP/MTP can be fetched here
๐ฎ Planned Experiments & Next Steps
We are actively working on scaling up ChemMiniQ3-SAbRLo with more ambitious experiments โ all designed for rapid iteration:
- ๐ Pretraining on a larger dataset โ up to 2.9M SELFIES molecules
- โฑ RL fine-tuning with extended steps โ test reward alignment speed under 25-token constraint
- ๐ฌ Comparative evaluation โ SA-only vs ChemQ3 vs Mix reward modes
- ๐งช Benchmarking โ validity, novelty, drug-likeness, and synthetic accessibility metrics
- ๐ Automodel/AutoTokenizer integration โ verify full compatibility with HF ecosystem (e.g.,
pipeline(),generate(),Trainer) - ๐งฉ Plug-and-play reward modules โ allow users to swap reward functions without touching model code
โค๏ธ Support the Project
Training and scaling require significant computational resources.
If youโd like to support this research (e.g., helping us rent compute servers for rapid RL prototyping and MTP validation), you can contribute here:
Every bit of support helps us push ChemMiniQ3-SAbRLo further! ๐๐งฌ
To-Do
- [ongoing] Review, clean, and test train with existing codes
- [ongoing] Warm up training on 163K dataset for MTP
- [ongoing] Warm up PPO-RL with only Bioaware set on for 7000 steps
- Test and observe the stability of Mixed Rewards for 7000 steps
- Warm up PPO-RL with only SA set on for 7000 steps
- Upload both warm-up MTP and PPO-RL models to HF repo
- [ongoing] Write demo blocks and demo JupyterNotebook on training from scratch and how to generate using pretrained model(s)
- Ablation studies
- [priority] Implement and validate HF
AutoModelandAutoTokenizercompatibility - Complete pretraining on all ~1M dataset (when possible)
- Chunk I
- [ongoing] Chunk II
- Chunk III
- Chunk IV
- Chunk V
- Chunk VI
- Publish complete pretraining on GitHub and HF (if compatible)
- Complete RL fine-tuning on verified rewards system.
References
BibTeX
Qwen3
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}
COCONUTDB
@article{sorokina2021coconut,
title={COCONUT online: Collection of Open Natural Products database},
author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
journal={Journal of Cheminformatics},
volume={13},
number={1},
pages={2},
year={2021},
doi={10.1186/s13321-020-00478-9}
}
ChemBL34
@article{zdrazil2023chembl,
title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
journal={Nucleic Acids Research},
year={2023},
volume={gkad1004},
doi={10.1093/nar/gkad1004}
}
@misc{chembl34,
title={ChemBL34},
year={2023},
doi={10.6019/CHEMBL.database.34}
}
SuperNatural3
@article{Gallo2023,
author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
journal = {Nucleic Acids Research},
year = {2023},
month = jan,
day = {6},
volume = {51},
number = {D1},
pages = {D654-D659},
doi = {10.1093/nar/gkac1008}
}
Ranger21 Optimizer
@article{wright2021ranger21,
title={Ranger21: a synergistic deep learning optimizer},
author={Wright, Less and Demeure, Nestor},
year={2021},
journal={arXiv preprint arXiv:2106.13731},
}
