SmolVLA Base (ONNX Export)
This repository contains an ONNX export of the SmolVLA base policy model from the LeRobot ecosystem.
SmolVLA is Hugging Face’s lightweight vision‑language‑action model for robotics:contentReference. The original model has roughly 450 million parameters and is designed to be fine‑tuned on robot datasets collected with LeRobot.
The ONNX export in this repo preserves the same weights and behavior as the PyTorch model lerobot/smolvla_base, but packages the policy as several smaller ONNX graphs to enable hardware‑agnostic inference via ONNXRuntime.
Contents
The export splits the SmolVLA architecture into multiple components. Each .onnx file corresponds to a specific part of the model:
| File | Role in the SmolVLA architecture |
|---|---|
smolvlm_vision.onnx |
Vision encoder; processes RGB camera frames and produces visual embeddings. |
smolvlm_text.onnx |
Text encoder; converts tokenized instructions into language embeddings. |
smolvlm_expert_prefill.onnx |
“Prefill” stage of the action expert; conditions on vision and language context. |
smolvlm_expert_decode.onnx |
“Decode” stage of the action expert; autoregressively generates action tokens. |
state_projector.onnx |
Projects the robot’s sensorimotor state into the model’s latent space. |
time_in_projector.onnx |
Projects the current timestep into the latent space. |
time_out_projector.onnx |
Projects internal time features back into the expert. |
action_in_projector.onnx |
Projects previous action chunks into the latent space (for chunked generation). |
action_out_projector.onnx |
Projects the model’s output back into continuous control actions. |
All files are exported at opset 17 and use static shapes.
Model tree for ainekko/smolvla_base_onnx
Base model
lerobot/smolvla_base