|
|
--- |
|
|
library_name: ml-agents |
|
|
tags: |
|
|
- Pyramids |
|
|
- deep-reinforcement-learning |
|
|
- reinforcement-learning |
|
|
- ML-Agents-Pyramids |
|
|
--- |
|
|
# PPO-Pyramids Unity ML-Agents Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Architecture |
|
|
- **Algorithm**: Proximal Policy Optimization (PPO) |
|
|
- **Framework**: Unity ML-Agents with PyTorch backend |
|
|
- **Policy Type**: Actor-Critic with shared feature extraction |
|
|
- **Network Architecture**: |
|
|
- Hidden Units: 512 per layer |
|
|
- Number of Layers: 2 |
|
|
- Activation: ReLU (default) |
|
|
- Normalization: Disabled |
|
|
- Visual Encoding: Simple CNN for visual observations |
|
|
|
|
|
### Environment: Pyramids |
|
|
|
|
|
The Pyramids environment is one of Unity ML-Agents' example environments featuring: |
|
|
- **Objective**: Navigate to randomly spawned goal locations |
|
|
- **Setting**: 3D pyramid-like structures with multiple levels and obstacles |
|
|
- **Complexity**: Multi-agent environment with navigation and spatial reasoning challenges |
|
|
- **Visual Component**: First-person or third-person visual observations |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
### PPO Hyperparameters |
|
|
```yaml |
|
|
batch_size: 128 |
|
|
buffer_size: 2048 |
|
|
learning_rate: 0.0003 |
|
|
beta: 0.01 # Entropy regularization |
|
|
epsilon: 0.2 # PPO clipping parameter |
|
|
lambda: 0.95 # GAE parameter |
|
|
num_epoch: 3 # Training epochs per update |
|
|
learning_rate_schedule: linear |
|
|
``` |
|
|
|
|
|
### Network Settings |
|
|
```yaml |
|
|
normalize: false # Input normalization |
|
|
hidden_units: 512 # Units per hidden layer |
|
|
num_layers: 2 # Number of hidden layers |
|
|
vis_encode_type: simple # Visual encoder type |
|
|
``` |
|
|
|
|
|
### Reward Structure |
|
|
- **Extrinsic Rewards**: |
|
|
- Gamma: 0.99 (discount factor) |
|
|
- Strength: 1.0 |
|
|
- Sparse rewards for reaching goals |
|
|
- Time penalties for efficiency |
|
|
|
|
|
- **Intrinsic Rewards (RND)**: |
|
|
- Random Network Distillation for exploration |
|
|
- Gamma: 0.99 |
|
|
- Strength: 0.01 |
|
|
- Separate network: 64 units, 3 layers |
|
|
- Learning rate: 0.0001 |
|
|
|
|
|
### Training Process |
|
|
- **Max Steps**: 1,000,000 training steps |
|
|
- **Time Horizon**: 128 steps per trajectory |
|
|
- **Checkpoints**: Keep 5 best models |
|
|
- **Summary Frequency**: Every 30,000 steps |
|
|
- **Training Time**: Approximately 4-8 hours on modern GPU |
|
|
|
|
|
## Observation Space |
|
|
|
|
|
The agent receives: |
|
|
- **Visual Observations**: RGB camera input (84x84x3 typically) |
|
|
- **Vector Observations**: Agent position, rotation, velocity |
|
|
- **Goal Information**: Relative goal position and distance |
|
|
- **Environmental Context**: Obstacle proximity, platform information |
|
|
|
|
|
## Action Space |
|
|
|
|
|
- **Action Type**: Continuous |
|
|
- **Action Dimensions**: 3-4 continuous values |
|
|
- Forward/backward movement |
|
|
- Left/right movement |
|
|
- Rotation (yaw) |
|
|
- Optional: Jump action |
|
|
|
|
|
## Performance Metrics |
|
|
|
|
|
### Expected Performance |
|
|
- **Goal Reaching Success Rate**: 80-95% |
|
|
- **Average Episode Length**: Optimal path finding |
|
|
- **Training Convergence**: Stable improvement over 1M steps |
|
|
- **Exploration Efficiency**: Balanced exploration vs exploitation |
|
|
|
|
|
### Key Metrics Tracked |
|
|
- **Cumulative Reward**: Total reward per episode |
|
|
- **Success Rate**: Percentage of episodes reaching goal |
|
|
- **Episode Length**: Steps to complete episode |
|
|
- **Policy Entropy**: Measure of action diversity |
|
|
- **Value Function Accuracy**: Critic network performance |
|
|
|
|
|
## Technical Implementation |
|
|
|
|
|
### PPO Algorithm Features |
|
|
- **Policy Clipping**: Prevents destructive policy updates (Ξ΅=0.2) |
|
|
- **Generalized Advantage Estimation**: GAE with Ξ»=0.95 |
|
|
- **Entropy Regularization**: Encourages exploration (Ξ²=0.01) |
|
|
- **Value Function Learning**: Shared network with policy |
|
|
|
|
|
### Random Network Distillation (RND) |
|
|
- **Purpose**: Intrinsic motivation for exploration |
|
|
- **Implementation**: Separate predictor and target networks |
|
|
- **Benefit**: Encourages visiting novel states |
|
|
- **Balance**: Low strength (0.01) to avoid overwhelming extrinsic rewards |
|
|
|
|
|
### Unity ML-Agents Integration |
|
|
- **Training Interface**: Python mlagents-learn command |
|
|
- **Environment Communication**: Unity-Python API |
|
|
- **Parallel Training**: Multiple environment instances |
|
|
- **Real-time Monitoring**: TensorBoard integration |
|
|
|
|
|
## Files and Structure |
|
|
|
|
|
``` |
|
|
βββ Pyramids.onnx # Trained policy network |
|
|
βββ Pyramids/ |
|
|
β βββ checkpoint-{step}.onnx # Training checkpoints |
|
|
β βββ configuration.yaml # Training configuration |
|
|
β βββ run_logs/ # Training metrics |
|
|
βββ results/ |
|
|
β βββ training_summary.json # Training statistics |
|
|
β βββ tensorboard_logs/ # TensorBoard data |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Loading the Model |
|
|
```python |
|
|
from mlagents_envs import UnityEnvironment |
|
|
from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel |
|
|
|
|
|
# Load environment |
|
|
channel = EngineConfigurationChannel() |
|
|
env = UnityEnvironment(file_name="Pyramids", side_channels=[channel]) |
|
|
|
|
|
# Model is loaded automatically when using mlagents-learn |
|
|
``` |
|
|
|
|
|
### Training Command |
|
|
```bash |
|
|
mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01 |
|
|
``` |
|
|
### Resume the training |
|
|
```bash |
|
|
mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume |
|
|
``` |
|
|
### Inference |
|
|
```python |
|
|
# The trained model can be used directly in Unity builds |
|
|
# or through the ML-Agents Python API for evaluation |
|
|
``` |
|
|
|
|
|
## Limitations and Considerations |
|
|
|
|
|
1. **Environment Specific**: Trained specifically for Pyramids environment layout |
|
|
2. **Visual Dependency**: Performance tied to visual observation quality |
|
|
3. **Exploration Balance**: RND parameters may need tuning for different scenarios |
|
|
4. **Computational Requirements**: Requires GPU for efficient training |
|
|
5. **Generalization**: May not transfer well to significantly different navigation tasks |
|
|
|
|
|
## Optimization Suggestions |
|
|
|
|
|
For improved performance, consider: |
|
|
- **Enable normalization**: `normalize: true` |
|
|
- **Increase network capacity**: `hidden_units: 768` |
|
|
- **Longer time horizon**: `time_horizon: 256` |
|
|
- **Higher batch size**: `batch_size: 256` |
|
|
- **More training steps**: `max_steps: 2000000` |
|
|
|
|
|
## Applications |
|
|
|
|
|
- **Game AI**: Intelligent NPC navigation in 3D games |
|
|
- **Robotics Research**: Transfer learning for robot navigation |
|
|
- **Pathfinding**: Advanced pathfinding algorithm development |
|
|
- **Educational**: Demonstration of RL in complex 3D environments |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
This model represents a benign navigation task with no ethical concerns: |
|
|
- **Content**: Abstract geometric environment |
|
|
- **Purpose**: Educational and research applications |
|
|
- **Safety**: No real-world safety implications |
|
|
|
|
|
## System Requirements |
|
|
|
|
|
### Training |
|
|
- **OS**: Windows 10+, macOS 10.14+, Ubuntu 18.04+ |
|
|
- **GPU**: NVIDIA GPU with CUDA support (recommended) |
|
|
- **RAM**: 8GB minimum, 16GB recommended |
|
|
- **Storage**: 2GB for environment and model files |
|
|
|
|
|
### Dependencies |
|
|
``` |
|
|
unity-ml-agents>=0.28.0 |
|
|
torch>=1.8.0 |
|
|
tensorboard |
|
|
numpy |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{ppo-pyramids-2024, |
|
|
title={PPO-Pyramids: Navigation Agent for Unity ML-Agents}, |
|
|
author={Adilbai}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face Hub}, |
|
|
url={https://huggingface.co/Adilbai/ppo-pyramids} |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347 |
|
|
- Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894 |
|
|
- Unity Technologies. ML-Agents Toolkit Documentation |
|
|
- Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627 |
|
|
|
|
|
## Training Logs and Monitoring |
|
|
|
|
|
Monitor training progress through: |
|
|
- **TensorBoard**: Real-time training metrics |
|
|
- **Console Output**: Episode rewards and statistics |
|
|
- **Checkpoint Analysis**: Model performance over time |
|
|
- **Success Rate Tracking**: Goal completion percentage |
|
|
|
|
|
--- |
|
|
|
|
|
*For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. ποΈπ―* |