SmolVLA Base (ONNX Export)

This repository contains an ONNX export of the SmolVLA base policy model from the LeRobot ecosystem.
SmolVLA is Hugging Face’s lightweight vision‑language‑action model for robotics:contentReference. The original model has roughly 450 million parameters and is designed to be fine‑tuned on robot datasets collected with LeRobot.

The ONNX export in this repo preserves the same weights and behavior as the PyTorch model lerobot/smolvla_base, but packages the policy as several smaller ONNX graphs to enable hardware‑agnostic inference via ONNXRuntime.

Contents

The export splits the SmolVLA architecture into multiple components. Each .onnx file corresponds to a specific part of the model:

File Role in the SmolVLA architecture
smolvlm_vision.onnx Vision encoder; processes RGB camera frames and produces visual embeddings.
smolvlm_text.onnx Text encoder; converts tokenized instructions into language embeddings.
smolvlm_expert_prefill.onnx “Prefill” stage of the action expert; conditions on vision and language context.
smolvlm_expert_decode.onnx “Decode” stage of the action expert; autoregressively generates action tokens.
state_projector.onnx Projects the robot’s sensorimotor state into the model’s latent space.
time_in_projector.onnx Projects the current timestep into the latent space.
time_out_projector.onnx Projects internal time features back into the expert.
action_in_projector.onnx Projects previous action chunks into the latent space (for chunked generation).
action_out_projector.onnx Projects the model’s output back into continuous control actions.

All files are exported at opset 17 and use static shapes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for ainekko/smolvla_base_onnx

Quantized
(1)
this model

Dataset used to train ainekko/smolvla_base_onnx