Description:

GR00T-N1-2B-tuned-Exhaust-Pipe-Sorting-task is a domain-adapted version of the GR00T-N1-2B model, fine-tuned to excel at exhaust pipe sorting instructions or reasoning, using supervised learning and modern fine-tuning techniques to specialize the model for this specific task.

This model is ready for commercial/non-commercial use.

License/Terms of Use

NSCL V1 License
NVIDIA OneWay Noncommercial License_22Mar2022

Deployment Geography:

Global

Use Case:

Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development. Developers: Integrate and customize AI for various robotic applications. Startups & Companies: Accelerate robotics development and reduce training costs.

Release Date:

07/17/2025: Huggingface via [https://huggingface.co/nvidia/GR00T-N1-2B-tuned-Exhaust-Pipe-Sorting-task]

References:

https://huggingface.co/nvidia/GR00T-N1-2B
[1] @inproceedings{mandlekar2023mimicgen, title={MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations}, author={Mandlekar, Ajay and Nasiriany, Soroush and Wen, Bowen and Akinola, Iretiayo and Narang, Yashraj and Fan, Linxi and Zhu, Yuke and Fox, Dieter}, booktitle={7th Annual Conference on Robot Learning}, year={2023} }

Model Architecture:

Architecture Type: Vision Transformer, Multilayer Perceptron, Flow matching Transformer

Isaac GR00T N1-2B uses vision and text transformers to encode the robot's image observations and text instructions. The architecture handles a varying number of views per embodiment by concatenating image token embeddings from all frames into a sequence, followed by language token embeddings.

To model proprioception and a sequence of actions conditioned on observations, Isaac GR00T N1-2B uses a flow matching transformer. The flow matching transformer interleaves self-attention over proprioception and actions with cross-attention to the vision and language embeddings. During training, the input actions are corrupted by randomly interpolating between the clean action vector and a gaussian noise vector. At inference time, the policy first samples a gaussian noise vector and iteratively reconstructs a continuous-value action using its velocity prediction.

This model is based on GR00T-N1-2B.

Network Architecture: RGB camera frames and text are jointly encoded using a pre-trained vision language model (VLM), Nvidia EAGLE. This VLM backbone produces a variable-length sequence of image and language token embeddings. The action head is a flow matching transformer model of proprioception and action that cross-attends to the vision-language embeddings. Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprioception, inputs are padded to a configurable max length before feeding into the MLP. Continuous actions are decoded by an MLP, one per unique embodiment. The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).

This model is based on GR00T-N1-2B.

Input:

Input Type:

Vision: Image Frames
State: Robot Proprioception
Language Instruction: Text
Embodiment ID: Integer

Input Format:

Vision: Variable number of 224x224 uint8 image frames, coming from robot cameras
State: Floating Point
Language Instruction: String
Embodiment ID: Integer indicating which of the training embodiments is observed

Input Parameters:

Vision: 2D - RGB image, square
State: 1D - Floating number vector
Language Instruction: 1D - String
Embodiment ID: 1D - Integer

Output:

Output Type(s): Actions

Output Format Continuous-value vectors that correspond to different motor controls on a robot.

Output Parameters: 1D floating number vector

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): PyTorch

Supported Hardware Microarchitecture Compatibility: All of the below:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace

[Preferred/Supported] Operating System(s):

Linux

Model Version(s):

This is the initial version of the model, version 1.0.

Training and Evaluation Datasets:

Training Dataset:

Dataset: Exhaust Pipe Task [https://huggingface.co/datasets/nvidia/PhysicalAI-GR00T-Tuned-Tasks]

Data Collection Method by Dataset: Hybrid: Automated, Automatic/Sensors, Synthetic

Labeling Collection Method by Dataset: Not Applicable

Properties: 5 human teleoperated demonstrations are collected through Apple Vision Pro in Isaac Lab. All 1,000 demos are generated automatically using a synthetic motion trajectory generation framework, Mimicgen [1]. Each demo is generated at 20 Hz.

How To Use:

For information on how to use the model, please visit this repository.

Inference:

Engine: PyTorch Test Hardware: A6000 Ada

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Model Limitations:

This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms.

Risk: Model underperformance in highly dynamic environments. Mitigation: Enhance dataset with dynamic obstacle scenarios and fine-tune models accordingly.
Risk: Integration challenges in specific customer environments. Mitigation: Provide detailed integration guides and support, leveraging NVIDIA's ecosystem.
Risk: Limited initial support for certain robot embodiments. Mitigation: Expand testing and validation across a wider range of robot platforms.

Bias

Field	Response
Participation considerations from adversely impacted groups protected classes in model design and testing:	Not Applicable
Measures taken to mitigate against unwanted bias:	Not Applicable

Explainability

Field	Response
Intended Application & Domain:	Open foundation model for generalized humanoid robot reasoning and skills
Model Type:	Robot VLA model
Intended Users:	This model is intended for developers and community that build and finetune robot foundation models.
Output:	The model outputs are actions, and the units are floating points. This is referred to as "robot action policy." Actions consist of continuous-value vectors that correspond to different motor controls on a robot.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:	Not Applicable
Describe how the model works:	Accepts vision, language and proprioception, outputs robot action policy.
Technical Limitations:	- Model underperformance in highly dynamic environments. - Integration challenges in specific customer environments. - Limited initial support for certain robot embodiments."
Verified to have met prescribed NVIDIA quality standards:	Yes
Performance Metrics:	Success rate, as well as the following: 1) if the trajectory is smooth and does not jitter 2) if the robot hits any other objects 3) if the trajectory is natural
Potential Known Risks:	This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms.
Licensing:	NSCL V1 License

Privacy

Field	Response
Generatable or reverse engineerable personal information?	None Known
Personal data used to create this model?	No
How often is dataset reviewed?	Before Release
Is there provenance for all datasets used in training?	Yes
Does data labeling (annotation, metadata) comply with privacy laws?	Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?	Yes
Applicable Privacy Policy	NVIDIA Privacy Policy

Safety

Field	Response
Model Application(s):	Robot VLA - single-arm manipulation, bimanual grippers, bi-manual dex hands manipulation and humanoid dexterous manipulation
List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management]	Machinery and Robotics
Describe the life critical impact (if present).	This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms
Use Case Restrictions:	Abide by NSCL V1 License
Model and dataset restrictions:	The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

Downloads last month: 44

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/GR00T-N1-2B-tuned-Exhaust-Pipe-Sorting-task

Base model

nvidia/GR00T-N1-2B

Finetuned

(5)

this model

nvidia
/

GR00T-N1-2B-tuned-Exhaust-Pipe-Sorting-task