RLinf: Reinforcement Learning Infrastructure for Agentic AI
RLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.
Model Description
This openvla-oft model is trained on Haozhan72/Openvla-oft-SFT-libero10-trajall with an additional lora SFT checkpoint RLinf/RLinf-OpenVLAOFT-ManiSkill-Base-Lora and finetuned by Proximal Policy Optimization (PPO) on the ManiSkill simulator.
Full OOD Evaluation and Results
Overall Eval Results
Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study.
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
|---|---|---|---|---|---|
| Avg results | 0.7915 | 0.6064 | 0.7705 | 0.8193 | 0.7515 |
Training Setting Eval
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
|---|---|---|---|---|---|
| Avg results | 0.9375 | 0.9414 | 0.9766 | 0.9609 | 0.8438 |
OOD Eval on Vision
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
|---|---|---|---|---|---|
| vision avg | 0.8047 | 0.8469 | 0.9211 | 0.8203 | 0.7469 |
| unseen table | 0.9063 | 0.9141 | 0.9648 | 0.9570 | 0.8984 |
| dynamic texture (weak) | 0.8516 | 0.9102 | 0.9492 | 0.8555 | 0.7891 |
| dynamic texture (strong) | 0.7500 | 0.7734 | 0.8633 | 0.7227 | 0.6563 |
| dynamic noise (weak) | 0.8281 | 0.8945 | 0.9805 | 0.8711 | 0.7969 |
| dynamic noise (strong) | 0.6875 | 0.7422 | 0.8477 | 0.6953 | 0.5938 |
OOD Eval on Semantic
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
|---|---|---|---|---|---|
| object avg | 0.7500 | 0.4553 | 0.6484 | 0.7835 | 0.7299 |
| unseen objects | 0.8281 | 0.8047 | 0.8594 | 0.8164 | 0.7656 |
| unseen receptacles | 0.6875 | 0.7422 | 0.8750 | 0.8125 | 0.7344 |
| unseen instructions | 0.8203 | 0.6797 | 0.7109 | 0.9453 | 0.8906 |
| multi-object (both seen) | 0.7891 | 0.3516 | 0.6055 | 0.8438 | 0.7578 |
| multi-object (both unseen) | 0.5703 | 0.3047 | 0.5508 | 0.6289 | 0.5781 |
| distractive receptacle | 0.8047 | 0.1875 | 0.6133 | 0.8281 | 0.7813 |
| multi-receptacle (both unseen) | 0.7500 | 0.3242 | 0.23828125 | 0.6094 | 0.6016 |
OOD Eval on Position
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | PPO-openvla | GRPO-openvla |
|---|---|---|---|---|---|
| position avg | 0.8177 | 0.4466 | 0.7357 | 0.8542 | 0.7786 |
| unseen position (object & receptacle) | 0.7344 | 0.4023 | 0.6992 | 0.8633 | 0.7500 |
| unseen robot init pose | 0.8359 | 0.4805 | 0.7188 | 0.7773 | 0.7031 |
| mid-episode object reposition | 0.8828 | 0.4570 | 0.7891 | 0.9212 | 0.8828 |
How to Use
Please integrate the provided model with the RLinf codebase. To do so, modify the following parameters in the configuration file examples/embodiment/config/maniskill_ppo_openvlaoft.yaml:
- Set
actor.checkpoint_load_path,actor.tokenizer.tokenizer_model, androllout.model_dirto the path of the model checkpoint.
Note: If you intend to evaluate the model directly, make sure to set actor.model.is_lora to false.
License
This code repository and the model weights are licensed under the MIT License.
- Downloads last month
- 26
Evaluation results
- accuracy on maniskill-trainself-reported97.660
- accuracy on maniskill-visionself-reported92.110
- accuracy on maniskill-semanticself-reported64.840
- accuracy on maniskill-positionself-reported73.570