inframind-0.5b-dapo / README.md
srallabandi0225's picture
Update README.md
f2f2e70 verified
---
license: mit
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- infrastructure-as-code
- terraform
- kubernetes
- docker
- devops
- iac
- dapo
- reinforcement-learning
- fine-tuned
base_model: srallabandi0225/inframind-0.5b-grpo
datasets:
- custom
model-index:
- name: inframind-dapo
results:
- task:
type: text-generation
name: IaC Generation
dataset:
name: InfraMind-Bench
type: custom
metrics:
- type: accuracy
value: 96.4
name: DAPO Accuracy
---
# InfraMind-DAPO: Infrastructure-as-Code Model with Direct Advantage Policy Optimization
**InfraMind-DAPO** is a 0.5B parameter language model fine-tuned for Infrastructure-as-Code (IaC) generation using **DAPO (Direct Advantage Policy Optimization)** - an advanced reinforcement learning technique that builds upon GRPO.
## Model Description
| Attribute | Value |
|-----------|-------|
| **Base Model** | [inframind-0.5b-grpo](https://huggingface.co/srallabandi0225/inframind-0.5b-grpo) |
| **Original Base** | Qwen/Qwen2.5-0.5B-Instruct |
| **Parameters** | 500M |
| **Training Method** | DAPO (Direct Advantage Policy Optimization) |
| **Domain** | Infrastructure-as-Code |
| **License** | MIT |
### Training Pipeline
```
Qwen2.5-0.5B-Instruct → GRPO Training → inframind-grpo → DAPO Training → inframind-dapo
(Stage 1) (Stage 2 - This Model)
```
This model is the **second stage** of InfraMind training, starting from the GRPO-trained checkpoint and applying DAPO innovations for enhanced learning.
## What is DAPO?
**Direct Advantage Policy Optimization (DAPO)** is an advanced RL algorithm that improves upon GRPO with four key innovations:
| Innovation | Description | Benefit |
|------------|-------------|---------|
| **Clip-Higher** | Asymmetric clipping (ε_low=0.2, ε_high=0.28) | Allows high-advantage tokens to be reinforced more strongly |
| **Dynamic Sampling** | Skip batches with uniform rewards | Prevents entropy collapse, maintains exploration |
| **Token-Level Loss** | Per-token policy gradient | Finer-grained credit assignment |
| **Overlong Punishment** | Soft length penalty | Prevents verbose, repetitive outputs |
### Why DAPO After GRPO?
| Stage | Method | Purpose |
|-------|--------|---------|
| Stage 1 | GRPO | Establish IaC generation capability from base model |
| Stage 2 | DAPO | Refine with advanced techniques for quality improvement |
## Evaluation Results
| Model | Training Method | Accuracy | Pass Threshold |
|-------|-----------------|----------|----------------|
| **inframind-grpo** | GRPO | **97.3%** | 0.6 |
| **inframind-dapo** | DAPO | **96.4%** | 0.6 |
| Base (Qwen2.5-0.5B) | None | ~30% | 0.6 |
Evaluated on **InfraMind-Bench** (110 held-out test samples) across:
- Terraform (AWS, GCP, Azure)
- Kubernetes (Deployments, Services, Ingress)
- Docker (Dockerfile, docker-compose)
- CI/CD (GitHub Actions, GitLab CI)
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load DAPO model
model = AutoModelForCausalLM.from_pretrained("srallabandi0225/inframind-0.5b-dapo")
tokenizer = AutoTokenizer.from_pretrained("srallabandi0225/inframind-0.5b-dapo")
# Generate Terraform
prompt = """### Instruction:
Create Terraform for AWS EC2 instance
### Input:
t3.micro instance type
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Example Output
```hcl
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "web-server"
}
}
```
## Supported IaC Categories
| Category | Examples | Coverage |
|----------|----------|----------|
| **Terraform** | EC2, S3, VPC, RDS, EKS, Lambda, IAM | AWS, GCP, Azure |
| **Kubernetes** | Deployment, Service, Ingress, ConfigMap, RBAC | All K8s resources |
| **Docker** | Dockerfile, docker-compose | Multi-stage builds |
| **CI/CD** | GitHub Actions, GitLab CI, Jenkins | Workflows, pipelines |
| **Ansible** | Playbooks, roles | Server configuration |
| **Helm** | Charts, values.yaml | K8s package management |
## Training Details
### DAPO Configuration
```yaml
Training:
epochs: 2
batch_size: 16 (effective)
learning_rate: 5e-6
beta (KL): 0.0 # Pure DAPO - no KL penalty
generations_per_prompt: 8
DAPO Innovations:
clip_higher:
epsilon_low: 0.2
epsilon_high: 0.28
dynamic_sampling: true
token_level_loss: true
overlong_punishment:
enabled: true
soft_penalty: true
LoRA:
r: 16
alpha: 32
target_modules: [q_proj, k_proj, v_proj, o_proj]
```
### Reward Function
Domain-specific reward for IaC quality:
```
Reward = α × Syntax + β × Correctness + γ × Format
Where:
- Syntax (α=0.4): Valid resource declarations
- Correctness (β=0.3): Correct resource types
- Format (γ=0.3): Proper structure
```
## GRPO vs DAPO Comparison
| Aspect | GRPO | DAPO |
|--------|------|------|
| KL Penalty | β=0.04 | β=0.0 (none) |
| Clipping | Symmetric | Asymmetric (Clip-Higher) |
| Loss Granularity | Sequence-level | Token-level |
| Sampling | All batches | Dynamic (skip uniform) |
| Length Control | None | Overlong punishment |
## Hardware Requirements
| Deployment | Memory | GPU |
|------------|--------|-----|
| Training | 16GB+ | A100/A10G |
| Inference | 2GB | Optional |
| Edge (Raspberry Pi 5) | 4GB | None |
The 0.5B model is small enough to run on edge devices, making it suitable for:
- Air-gapped environments
- Local development
- CI/CD pipelines
- IoT/Edge infrastructure
## Limitations
- **IaC-specific**: Optimized for infrastructure tasks, not general conversation
- **English only**: Training data is in English
- **No execution**: Generates code, does not execute or validate against real infrastructure
- **Version-sensitive**: Generated code may use older API versions
- **Security**: Always review generated code for security best practices
### Out-of-Scope Uses
- Legal or medical advice
- General-purpose chatbot
- Executing infrastructure changes without human review
- Production deployment without validation
## Intended Use
### Primary Use Cases
- Generating Terraform configurations
- Creating Kubernetes manifests
- Writing Dockerfiles and docker-compose
- Building CI/CD pipelines
- Infrastructure automation scripting
### Users
- DevOps engineers
- Platform engineers
- SREs
- Cloud architects
- Infrastructure developers
## Training Data
**InfraMind-Bench**: 2000+ IaC tasks in Alpaca format
| Category | Tasks |
|----------|-------|
| Terraform | 500+ |
| Kubernetes | 400+ |
| Docker | 300+ |
| CI/CD | 300+ |
| Ansible | 200+ |
| Helm | 150+ |
| Monitoring | 150+ |
## Ethical Considerations
- Model may generate insecure configurations if not prompted for security
- Generated infrastructure code should always be reviewed before deployment
- Model does not have access to real infrastructure or credentials
- Users are responsible for validating generated code against their security policies
## Citation
```bibtex
@misc{rallabandi2024inframind,
title={InfraMind: Fine-tuning Small Language Models for Infrastructure-as-Code Generation with Reinforcement Learning},
author={Rallabandi, Sai Kiran},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/srallabandi0225/inframind-0.5b-dapo}
}
```
## Links
- **GitHub**: [github.com/saikiranrallabandi/inframind](https://github.com/saikiranrallabandi/inframind)
- **GRPO Model**: [srallabandi0225/inframind-0.5b-grpo](https://huggingface.co/srallabandi0225/inframind-0.5b-grpo)
- **DAPO Model**: [srallabandi0225/inframind-0.5b-dapo](https://huggingface.co/srallabandi0225/inframind-0.5b-dapo)
## Acknowledgments
- [Qwen Team](https://github.com/QwenLM/Qwen) for the base model
- [DeepSeek](https://github.com/deepseek-ai) for GRPO
- [NVIDIA NeMo](https://docs.nvidia.com/nemo) for DAPO reference
- [TRL](https://github.com/huggingface/trl) for training infrastructure
## Model Card Contact
**Author**: Sai Kiran Rallabandi
**GitHub**: [@saikiranrallabandi](https://github.com/saikiranrallabandi)