Description:
GR00T-N1-2B-tuned-Exhaust-Pipe-Sorting-task is a domain-adapted version of the GR00T-N1-2B model, fine-tuned to excel at exhaust pipe sorting instructions or reasoning, using supervised learning and modern fine-tuning techniques to specialize the model for this specific task.
This model is ready for commercial/non-commercial use.
License/Terms of Use
- NSCL V1 License
- NVIDIA OneWay Noncommercial License_22Mar2022
Deployment Geography:
Global
Use Case:
Researchers, Academics, Open-Source Community: AI-driven robotics research and algorithm development. Developers: Integrate and customize AI for various robotic applications. Startups & Companies: Accelerate robotics development and reduce training costs.
Release Date:
07/17/2025: Huggingface via [https://huggingface.co/nvidia/GR00T-N1-2B-tuned-Exhaust-Pipe-Sorting-task]
References:
- https://huggingface.co/nvidia/GR00T-N1-2B
- [1] @inproceedings{mandlekar2023mimicgen, title={MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations}, author={Mandlekar, Ajay and Nasiriany, Soroush and Wen, Bowen and Akinola, Iretiayo and Narang, Yashraj and Fan, Linxi and Zhu, Yuke and Fox, Dieter}, booktitle={7th Annual Conference on Robot Learning}, year={2023} }
Model Architecture:
Architecture Type: Vision Transformer, Multilayer Perceptron, Flow matching Transformer
Isaac GR00T N1-2B uses vision and text transformers to encode the robot's image observations and text instructions. The architecture handles a varying number of views per embodiment by concatenating image token embeddings from all frames into a sequence, followed by language token embeddings.
To model proprioception and a sequence of actions conditioned on observations, Isaac GR00T N1-2B uses a flow matching transformer. The flow matching transformer interleaves self-attention over proprioception and actions with cross-attention to the vision and language embeddings. During training, the input actions are corrupted by randomly interpolating between the clean action vector and a gaussian noise vector. At inference time, the policy first samples a gaussian noise vector and iteratively reconstructs a continuous-value action using its velocity prediction.
This model is based on GR00T-N1-2B.
Network Architecture: RGB camera frames and text are jointly encoded using a pre-trained vision language model (VLM), Nvidia EAGLE. This VLM backbone produces a variable-length sequence of image and language token embeddings. The action head is a flow matching transformer model of proprioception and action that cross-attends to the vision-language embeddings. Robot proprioception is encoded using a multi-layer perceptron (MLP) indexed by the embodiment ID. To handle variable-dimension proprioception, inputs are padded to a configurable max length before feeding into the MLP. Continuous actions are decoded by an MLP, one per unique embodiment. The flow matching transformer is implemented as a diffusion transformer (DiT), in which the diffusion step conditioning is implemented using adaptive layernorm (AdaLN).
This model is based on GR00T-N1-2B.
Input:
Input Type:
- Vision: Image Frames
- State: Robot Proprioception
- Language Instruction: Text
- Embodiment ID: Integer
Input Format:
- Vision: Variable number of 224x224 uint8 image frames, coming from robot cameras
- State: Floating Point
- Language Instruction: String
- Embodiment ID: Integer indicating which of the training embodiments is observed
Input Parameters:
- Vision: 2D - RGB image, square
- State: 1D - Floating number vector
- Language Instruction: 1D - String
- Embodiment ID: 1D - Integer
Output:
Output Type(s): Actions
Output Format Continuous-value vectors that correspond to different motor controls on a robot.
Output Parameters: 1D floating number vector
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s): PyTorch
Supported Hardware Microarchitecture Compatibility: All of the below:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
[Preferred/Supported] Operating System(s):
- Linux
Model Version(s):
This is the initial version of the model, version 1.0.
Training and Evaluation Datasets:
Training Dataset:
Dataset: Exhaust Pipe Task [https://huggingface.co/datasets/nvidia/PhysicalAI-GR00T-Tuned-Tasks]
Data Collection Method by Dataset: Hybrid: Automated, Automatic/Sensors, Synthetic
Labeling Collection Method by Dataset: Not Applicable
Properties: 5 human teleoperated demonstrations are collected through Apple Vision Pro in Isaac Lab. All 1,000 demos are generated automatically using a synthetic motion trajectory generation framework, Mimicgen [1]. Each demo is generated at 20 Hz.
How To Use:
For information on how to use the model, please visit this repository.
Inference:
Engine: PyTorch Test Hardware: A6000 Ada
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.
Model Limitations:
This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms.
Risk: Model underperformance in highly dynamic environments. Mitigation: Enhance dataset with dynamic obstacle scenarios and fine-tune models accordingly.
Risk: Integration challenges in specific customer environments. Mitigation: Provide detailed integration guides and support, leveraging NVIDIA's ecosystem.
Risk: Limited initial support for certain robot embodiments. Mitigation: Expand testing and validation across a wider range of robot platforms.
Bias
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups protected classes in model design and testing: | Not Applicable |
| Measures taken to mitigate against unwanted bias: | Not Applicable |
Explainability
| Field | Response |
|---|---|
| Intended Application & Domain: | Open foundation model for generalized humanoid robot reasoning and skills |
| Model Type: | Robot VLA model |
| Intended Users: | This model is intended for developers and community that build and finetune robot foundation models. |
| Output: | The model outputs are actions, and the units are floating points. This is referred to as "robot action policy." Actions consist of continuous-value vectors that correspond to different motor controls on a robot. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
| Describe how the model works: | Accepts vision, language and proprioception, outputs robot action policy. |
| Technical Limitations: | - Model underperformance in highly dynamic environments. - Integration challenges in specific customer environments. - Limited initial support for certain robot embodiments." |
| Verified to have met prescribed NVIDIA quality standards: | Yes |
| Performance Metrics: | Success rate, as well as the following: 1) if the trajectory is smooth and does not jitter 2) if the robot hits any other objects 3) if the trajectory is natural |
| Potential Known Risks: | This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms. |
| Licensing: | NSCL V1 License |
Privacy
| Field | Response |
|---|---|
| Generatable or reverse engineerable personal information? | None Known |
| Personal data used to create this model? | No |
| How often is dataset reviewed? | Before Release |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
| Applicable Privacy Policy | NVIDIA Privacy Policy |
Safety
| Field | Response |
|---|---|
| Model Application(s): | Robot VLA - single-arm manipulation, bimanual grippers, bi-manual dex hands manipulation and humanoid dexterous manipulation |
| List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] | Machinery and Robotics |
| Describe the life critical impact (if present). | This model is not tested or intended for use in mission critical applications that require functional safety. The use of the model in those applications is at the user's own risk and sole responsibility, including taking the necessary steps to add needed guardrails or safety mechanisms |
| Use Case Restrictions: | Abide by NSCL V1 License |
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
- Downloads last month
- 44
Model tree for nvidia/GR00T-N1-2B-tuned-Exhaust-Pipe-Sorting-task
Base model
nvidia/GR00T-N1-2B