Update README.md

cf89c9b verified 5 months ago

8.2 kB

	---
	library_name: ml-agents
	tags:
	- Pyramids
	- deep-reinforcement-learning
	- reinforcement-learning
	- ML-Agents-Pyramids
	---
	# PPO-Pyramids Unity ML-Agents Model

	## Model Description

	This model is a Proximal Policy Optimization (PPO) agent trained to navigate and solve the Pyramids environment from Unity ML-Agents. The Pyramids environment is a complex 3D navigation and puzzle-solving task where agents must learn to reach goals while avoiding obstacles and navigating through pyramid-like structures.

	## Model Details

	### Model Architecture
	- Algorithm: Proximal Policy Optimization (PPO)
	- Framework: Unity ML-Agents with PyTorch backend
	- Policy Type: Actor-Critic with shared feature extraction
	- Network Architecture:
	- Hidden Units: 512 per layer
	- Number of Layers: 2
	- Activation: ReLU (default)
	- Normalization: Disabled
	- Visual Encoding: Simple CNN for visual observations

	### Environment: Pyramids

	The Pyramids environment is one of Unity ML-Agents' example environments featuring:
	- Objective: Navigate to randomly spawned goal locations
	- Setting: 3D pyramid-like structures with multiple levels and obstacles
	- Complexity: Multi-agent environment with navigation and spatial reasoning challenges
	- Visual Component: First-person or third-person visual observations

	## Training Configuration

	### PPO Hyperparameters
	```yaml
	batch_size: 128
	buffer_size: 2048
	learning_rate: 0.0003
	beta: 0.01 # Entropy regularization
	epsilon: 0.2 # PPO clipping parameter
	lambda: 0.95 # GAE parameter
	num_epoch: 3 # Training epochs per update
	learning_rate_schedule: linear
	```

	### Network Settings
	```yaml
	normalize: false # Input normalization
	hidden_units: 512 # Units per hidden layer
	num_layers: 2 # Number of hidden layers
	vis_encode_type: simple # Visual encoder type
	```

	### Reward Structure
	- Extrinsic Rewards:
	- Gamma: 0.99 (discount factor)
	- Strength: 1.0
	- Sparse rewards for reaching goals
	- Time penalties for efficiency

	- Intrinsic Rewards (RND):
	- Random Network Distillation for exploration
	- Gamma: 0.99
	- Strength: 0.01
	- Separate network: 64 units, 3 layers
	- Learning rate: 0.0001

	### Training Process
	- Max Steps: 1,000,000 training steps
	- Time Horizon: 128 steps per trajectory
	- Checkpoints: Keep 5 best models
	- Summary Frequency: Every 30,000 steps
	- Training Time: Approximately 4-8 hours on modern GPU

	## Observation Space

	The agent receives:
	- Visual Observations: RGB camera input (84x84x3 typically)
	- Vector Observations: Agent position, rotation, velocity
	- Goal Information: Relative goal position and distance
	- Environmental Context: Obstacle proximity, platform information

	## Action Space

	- Action Type: Continuous
	- Action Dimensions: 3-4 continuous values
	- Forward/backward movement
	- Left/right movement
	- Rotation (yaw)
	- Optional: Jump action

	## Performance Metrics

	### Expected Performance
	- Goal Reaching Success Rate: 80-95%
	- Average Episode Length: Optimal path finding
	- Training Convergence: Stable improvement over 1M steps
	- Exploration Efficiency: Balanced exploration vs exploitation

	### Key Metrics Tracked
	- Cumulative Reward: Total reward per episode
	- Success Rate: Percentage of episodes reaching goal
	- Episode Length: Steps to complete episode
	- Policy Entropy: Measure of action diversity
	- Value Function Accuracy: Critic network performance

	## Technical Implementation

	### PPO Algorithm Features
	- Policy Clipping: Prevents destructive policy updates (ε=0.2)
	- Generalized Advantage Estimation: GAE with λ=0.95
	- Entropy Regularization: Encourages exploration (β=0.01)
	- Value Function Learning: Shared network with policy

	### Random Network Distillation (RND)
	- Purpose: Intrinsic motivation for exploration
	- Implementation: Separate predictor and target networks
	- Benefit: Encourages visiting novel states
	- Balance: Low strength (0.01) to avoid overwhelming extrinsic rewards

	### Unity ML-Agents Integration
	- Training Interface: Python mlagents-learn command
	- Environment Communication: Unity-Python API
	- Parallel Training: Multiple environment instances
	- Real-time Monitoring: TensorBoard integration

	## Files and Structure

	```
	├── Pyramids.onnx # Trained policy network
	├── Pyramids/
	│ ├── checkpoint-{step}.onnx # Training checkpoints
	│ ├── configuration.yaml # Training configuration
	│ └── run_logs/ # Training metrics
	├── results/
	│ ├── training_summary.json # Training statistics
	│ └── tensorboard_logs/ # TensorBoard data
	```

	## Usage

	### Loading the Model
	```python
	from mlagents_envs import UnityEnvironment
	from mlagents_envs.side_channel.engine_configuration_channel import EngineConfigurationChannel

	# Load environment
	channel = EngineConfigurationChannel()
	env = UnityEnvironment(file_name="Pyramids", side_channels=[channel])

	# Model is loaded automatically when using mlagents-learn
	```

	### Training Command
	```bash
	mlagents-learn config.yaml --env=Pyramids --run-id=pyramids_run_01
	```
	### Resume the training
	```bash
	mlagents-learn <your_configuration_file_path.yaml> --run-id=<run_id> --resume
	```
	### Inference
	```python
	# The trained model can be used directly in Unity builds
	# or through the ML-Agents Python API for evaluation
	```

	## Limitations and Considerations

	1. Environment Specific: Trained specifically for Pyramids environment layout
	2. Visual Dependency: Performance tied to visual observation quality
	3. Exploration Balance: RND parameters may need tuning for different scenarios
	4. Computational Requirements: Requires GPU for efficient training
	5. Generalization: May not transfer well to significantly different navigation tasks

	## Optimization Suggestions

	For improved performance, consider:
	- Enable normalization: `normalize: true`
	- Increase network capacity: `hidden_units: 768`
	- Longer time horizon: `time_horizon: 256`
	- Higher batch size: `batch_size: 256`
	- More training steps: `max_steps: 2000000`

	## Applications

	- Game AI: Intelligent NPC navigation in 3D games
	- Robotics Research: Transfer learning for robot navigation
	- Pathfinding: Advanced pathfinding algorithm development
	- Educational: Demonstration of RL in complex 3D environments

	## Ethical Considerations

	This model represents a benign navigation task with no ethical concerns:
	- Content: Abstract geometric environment
	- Purpose: Educational and research applications
	- Safety: No real-world safety implications

	## System Requirements

	### Training
	- OS: Windows 10+, macOS 10.14+, Ubuntu 18.04+
	- GPU: NVIDIA GPU with CUDA support (recommended)
	- RAM: 8GB minimum, 16GB recommended
	- Storage: 2GB for environment and model files

	### Dependencies
	```
	unity-ml-agents>=0.28.0
	torch>=1.8.0
	tensorboard
	numpy
	```

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{ppo-pyramids-2024,
	title={PPO-Pyramids: Navigation Agent for Unity ML-Agents},
	author={Adilbai},
	year={2024},
	publisher={Hugging Face Hub},
	url={https://huggingface.co/Adilbai/ppo-pyramids}
	}
	```

	## References

	- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
	- Burda, Y., et al. (2018). Exploration by Random Network Distillation. arXiv:1810.12894
	- Unity Technologies. ML-Agents Toolkit Documentation
	- Juliani, A., et al. (2018). Unity: A General Platform for Intelligent Agents. arXiv:1809.02627

	## Training Logs and Monitoring

	Monitor training progress through:
	- TensorBoard: Real-time training metrics
	- Console Output: Episode rewards and statistics
	- Checkpoint Analysis: Model performance over time
	- Success Rate Tracking: Goal completion percentage

	---

	For optimal results, consider using the improved configuration with normalization enabled and increased network capacity. 🏗️🎯