Adilbai's picture
Update README.md
cf89c9b verified
---
library_name: ml-agents
tags:
- Pyramids
- deep-reinforcement-learning
- reinforcement-learning
- ML-Agents-Pyramids
---
# PPO-Pyramids Unity ML-Agents Model
## Model Description
This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures.
## Model Details
### Model Architecture
- **Algorithm**: Proximal Policy Optimization (PPO)
- **Framework**: Unity ML-Agents with PyTorch backend
- **Policy Type**: Actor-Critic with shared feature extraction
- **Network Architecture**:
- Hidden Units: 512 per layer
- Number of Layers: 2
- Activation: ReLU (default)
- Normalization: Disabled
- Visual Encoding: Simple CNN for visual observations
### Environment: Pyramids
The Pyramids environment is one of Unity ML-Agents' example environments featuring:
- **Objective**: Navigate to randomly spawned goal locations
- **Setting**: 3D pyramid-like structures with multiple levels and obstacles
- **Complexity**: Multi-agent environment with navigation and spatial reasoning challenges
- **Visual Component**: First-person or third-person visual observations
## Training Configuration
### PPO Hyperparameters
```yaml
batch_size: 128
buffer_size: 2048
learning_rate: 0.0003
beta: 0.01 # Entropy regularization
epsilon: 0.2 # PPO clipping parameter
lambda: 0.95 # GAE parameter
num_epoch: 3 # Training epochs per update
learning_rate_schedule: linear
```
### Network Settings
```yaml
normalize: false # Input normalization
hidden_units: 512 # Units per hidden layer
num_layers: 2 # Number of hidden layers
vis_encode_type: simple # Visual encoder type
```
### Reward Structure
- **Extrinsic Rewards**:
- Gamma: 0.99 (discount factor)
- Strength: 1.0
- Sparse rewards for reaching goals
- Time penalties for efficiency
- **Intrinsic Rewards (RND)**:
- Random Network Distillation for exploration
- Gamma: 0.99
- Strength: 0.01
- Separate network: 64 units, 3 layers
- Learning rate: 0.0001
### Training Process
- **Max Steps**: 1,000,000 training steps
- **Time Horizon**: 128 steps per trajectory
- **Checkpoints**: Keep 5 best models
- **Summary Frequency**: Every 30,000 steps
- **Training Time**: Approximately 4-8 hours on modern GPU
## Observation Space
The agent receives:
- **Visual Observations**: RGB camera input (84x84x3 typically)
- **Vector Observations**: Agent position, rotation, velocity
- **Goal Information**: Relative goal position and distance
- **Environmental Context**: Obstacle proximity, platform information
## Action Space
- **Action Type**: Continuous
- **Action Dimensions**: 3-4 continuous values
- Forward/backward movement
- Left/right movement
- Rotation (yaw)
- Optional: Jump action
## Performance Metrics
### Expected Performance
- **Goal Reaching Success Rate**: 80-95%
- **Average Episode Length**: Optimal path finding
- **Training Convergence**: Stable improvement over 1M steps
- **Exploration Efficiency**: Balanced exploration vs exploitation
### Key Metrics Tracked
- **Cumulative Reward**: Total reward per episode
- **Success Rate**: Percentage of episodes reaching goal
- **Episode Length**: Steps to complete episode
- **Policy Entropy**: Measure of action diversity
- **Value Function Accuracy**: Critic network performance
## Technical Implementation
### PPO Algorithm Features
- **Policy Clipping**: Prevents destructive policy updates (Ξ΅=0.2)
- **Generalized Advantage Estimation**: GAE with Ξ»=0.95
- **Entropy Regularization**: Encourages exploration (Ξ²=0.01)
- **Value Function Learning**: Shared network with policy
### Random Network Distillation (RND)
- **Purpose**: Intrinsic motivation for exploration
- **Implementation**: Separate predictor and target networks
- **Benefit**: Encourages visiting novel states
- **Balance**: Low strength (0.01) to avoid overwhelming extrinsic rewards
### Unity ML-Agents Integration
- **Training Interface**: Python mlagents-learn command
- **Environment Communication**: Unity-Python API
- **Parallel Training**: Multiple environment instances
- **Real-time Monitoring**: TensorBoard integration
## Files and Structure
```
β”œβ”€β”€ Pyramids.onnx # Trained policy network
β”œβ”€β”€ Pyramids/
β”‚ β”œβ”€β”€ checkpoint-{step}.onnx # Training checkpoints
β”‚ β”œβ”€β”€ configuration.yaml # Training configuration
β”‚ └── run_logs/ # Training metrics
β”œβ”€β”€ results/
β”‚ β”œβ”€β”€ training_summary.json # Training statistics
β”‚ └── tensorboard_logs/ # TensorBoard data
```
## Usage
### Loading the Model
```python
from mlagents_envs import UnityEnvironment
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel
# Load environment
channel = EngineConfigurationChannel()
env = UnityEnvironment(file_name="Pyramids", side_channels=[channel])
# Model is loaded automatically when using mlagents-learn
```
### Training Command
```bash
mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01
```
### Resume the training
```bash
mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
```
### Inference
```python
# The trained model can be used directly in Unity builds
# or through the ML-Agents Python API for evaluation
```
## Limitations and Considerations
1. **Environment Specific**: Trained specifically for Pyramids environment layout
2. **Visual Dependency**: Performance tied to visual observation quality
3. **Exploration Balance**: RND parameters may need tuning for different scenarios
4. **Computational Requirements**: Requires GPU for efficient training
5. **Generalization**: May not transfer well to significantly different navigation tasks
## Optimization Suggestions
For improved performance, consider:
- **Enable normalization**: `normalize: true`
- **Increase network capacity**: `hidden_units: 768`
- **Longer time horizon**: `time_horizon: 256`
- **Higher batch size**: `batch_size: 256`
- **More training steps**: `max_steps: 2000000`
## Applications
- **Game AI**: Intelligent NPC navigation in 3D games
- **Robotics Research**: Transfer learning for robot navigation
- **Pathfinding**: Advanced pathfinding algorithm development
- **Educational**: Demonstration of RL in complex 3D environments
## Ethical Considerations
This model represents a benign navigation task with no ethical concerns:
- **Content**: Abstract geometric environment
- **Purpose**: Educational and research applications
- **Safety**: No real-world safety implications
## System Requirements
### Training
- **OS**: Windows 10+, macOS 10.14+, Ubuntu 18.04+
- **GPU**: NVIDIA GPU with CUDA support (recommended)
- **RAM**: 8GB minimum, 16GB recommended
- **Storage**: 2GB for environment and model files
### Dependencies
```
unity-ml-agents>=0.28.0
torch>=1.8.0
tensorboard
numpy
```
## Citation
If you use this model, please cite:
```bibtex
@misc{ppo-pyramids-2024,
title={PPO-Pyramids: Navigation Agent for Unity ML-Agents},
author={Adilbai},
year={2024},
publisher={Hugging Face Hub},
url={https://huggingface.co/Adilbai/ppo-pyramids}
}
```
## References
- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
- Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894
- Unity Technologies. ML-Agents Toolkit Documentation
- Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627
## Training Logs and Monitoring
Monitor training progress through:
- **TensorBoard**: Real-time training metrics
- **Console Output**: Episode rewards and statistics
- **Checkpoint Analysis**: Model performance over time
- **Success Rate Tracking**: Goal completion percentage
---
*For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. πŸ—οΈπŸŽ―*