LeRobot documentation

Meta-World

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Meta-World

Meta-World is a well-designed, open-source simulation benchmark for multi-task and meta reinforcement learning in continuous-control robotic manipulation. It gives researchers a shared, realistic playground to test whether algorithms can learn many different tasks and generalize quickly to new ones — two central challenges for real-world robotics.

MetaWorld MT10 demo

Why Meta-World matters

  • Diverse, realistic tasks. Meta-World bundles a large suite of simulated manipulation tasks (50 in the MT50 suite) using everyday objects and a common tabletop Sawyer arm. This diversity exposes algorithms to a wide variety of dynamics, contacts and goal specifications while keeping a consistent control and observation structure.
  • Focus on generalization and multi-task learning. By evaluating across task distributions that share structure but differ in goals and objects, Meta-World reveals whether an agent truly learns transferable skills rather than overfitting to a narrow task.
  • Standardized evaluation protocol. It provides clear evaluation modes and difficulty splits, so different methods can be compared fairly across easy, medium, hard and very-hard regimes.
  • Empirical insight. Past evaluations on Meta-World show impressive progress on some fronts, but also highlight that current multi-task and meta-RL methods still struggle with large, diverse task sets. That gap points to important research directions.

What it enables in LeRobot

In LeRobot, you can evaluate any policy or vision-language-action (VLA) model on Meta-World tasks and get a clear success-rate measure. The integration is designed to be straightforward:

  • We provide a LeRobot-ready dataset for Meta-World (MT50) on the HF Hub: https://huggingface.co/datasets/lerobot/metaworld_mt50.

    • This dataset is formatted for the MT50 evaluation that uses all 50 tasks (the most challenging multi-task setting).
    • MT50 gives the policy a one-hot task vector and uses fixed object/goal positions for consistency.
  • Task descriptions and the exact keys required for evaluation are available in the repo/dataset — use these to ensure your policy outputs the right success signals.

Quick start, train a SmolVLA policy on Meta-World

Example command to train a SmolVLA policy on a subset of tasks:

lerobot-train \
  --policy.type=smolvla \
  --policy.repo_id=${HF_USER}/metaworld-test \
  --policy.load_vlm_weights=true \
  --dataset.repo_id=lerobot/metaworld_mt50 \
  --env.type=metaworld \
  --env.task=assembly-v3,dial-turn-v3,handle-press-side-v3 \
  --output_dir=./outputs/ \
  --steps=100000 \
  --batch_size=4 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
  --eval_freq=1000

Notes:

  • --env.task accepts explicit task lists (comma separated) or difficulty groups (e.g., env.task="hard").
  • Adjust batch_size, steps, and eval_freq to match your compute budget.
  • Gymnasium Assertion Error: if you encounter an error like AssertionError: ['human', 'rgb_array', 'depth_array'] when running MetaWorld environments, this comes from a mismatch between MetaWorld and your Gymnasium version. We recommend using:
  pip install "gymnasium==1.1.0"

to ensure proper compatibility.

Quick start — evaluate a trained policy

To evaluate a trained policy on the Meta-World medium difficulty split:

lerobot-eval \
  --policy.path="your-policy-id" \
  --env.type=metaworld \
  --env.task=medium \
  --eval.batch_size=1 \
  --eval.n_episodes=2

This will run episodes and return per-task success rates using the standard Meta-World evaluation keys.

Practical tips

  • If you care about generalization, run on the full MT50 suite — it’s intentionally challenging and reveals strengths/weaknesses better than a few narrow tasks.
  • Use the one-hot task conditioning for multi-task training (MT10 / MT50 conventions) so policies have explicit task context.
  • Inspect the dataset task descriptions and the info["is_success"] keys when writing post-processing or logging so your success metrics line up with the benchmark.
Update on GitHub