RLinf: Reinforcement Learning Infrastructure for Agentic AI

RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.

Model Description

This openvla-oft model is trained on Haozhan72/Openvla-oft-SFT-libero10-trajall with an additional lora SFT checkpoint RLinf/RLinf-OpenVLAOFT-ManiSkill-Base-Lora and finetuned by Proximal Policy Optimization (PPO) on the ManiSkill simulator.

Full OOD Evaluation and Results

Overall Eval Results

Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
Avg results	0.7915	0.6064	0.7705	0.8193	0.7515

Training Setting Eval

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
Avg results	0.9375	0.9414	0.9766	0.9609	0.8438

OOD Eval on Vision

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
vision avg	0.8047	0.8469	0.9211	0.8203	0.7469
unseen table	0.9063	0.9141	0.9648	0.9570	0.8984
dynamic texture (weak)	0.8516	0.9102	0.9492	0.8555	0.7891
dynamic texture (strong)	0.7500	0.7734	0.8633	0.7227	0.6563
dynamic noise (weak)	0.8281	0.8945	0.9805	0.8711	0.7969
dynamic noise (strong)	0.6875	0.7422	0.8477	0.6953	0.5938

OOD Eval on Semantic

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
object avg	0.7500	0.4553	0.6484	0.7835	0.7299
unseen objects	0.8281	0.8047	0.8594	0.8164	0.7656
unseen receptacles	0.6875	0.7422	0.8750	0.8125	0.7344
unseen instructions	0.8203	0.6797	0.7109	0.9453	0.8906
multi-object (both seen)	0.7891	0.3516	0.6055	0.8438	0.7578
multi-object (both unseen)	0.5703	0.3047	0.5508	0.6289	0.5781
distractive receptacle	0.8047	0.1875	0.6133	0.8281	0.7813
multi-receptacle (both unseen)	0.7500	0.3242	0.23828125	0.6094	0.6016

OOD Eval on Position

Description	rl4vla	GRPO-openvlaoft	PPO-openvlaoft	PPO-openvla	GRPO-openvla
position avg	0.8177	0.4466	0.7357	0.8542	0.7786
unseen position (object & receptacle)	0.7344	0.4023	0.6992	0.8633	0.7500
unseen robot init pose	0.8359	0.4805	0.7188	0.7773	0.7031
mid-episode object reposition	0.8828	0.4570	0.7891	0.9212	0.8828

How to Use

Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_ppo_openvlaoft.yaml:

Set actor.checkpoint_load_path, actor.tokenizer.tokenizer_model, and rollout.model_dir to the path of the model checkpoint.

Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.

License

This code repository and the model weights are licensed under the MIT License.

Downloads last month: 26

Safetensors

Model size

8B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Evaluation results

accuracy on maniskill-train
self-reported

97.660
accuracy on maniskill-vision
self-reported

92.110
accuracy on maniskill-semantic
self-reported

64.840
accuracy on maniskill-position
self-reported

73.570

View on Papers With Code