Update README.md

f2f2e70 verified about 1 month ago

8.31 kB

	---
	license: mit
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- infrastructure-as-code
	- terraform
	- kubernetes
	- docker
	- devops
	- iac
	- dapo
	- reinforcement-learning
	- fine-tuned
	base_model: srallabandi0225/inframind-0.5b-grpo
	datasets:
	- custom
	model-index:
	- name: inframind-dapo
	results:
	- task:
	type: text-generation
	name: IaC Generation
	dataset:
	name: InfraMind-Bench
	type: custom
	metrics:
	- type: accuracy
	value: 96.4
	name: DAPO Accuracy
	---

	# InfraMind-DAPO: Infrastructure-as-Code Model with Direct Advantage Policy Optimization

	InfraMind-DAPO is a 0.5B parameter language model fine-tuned for Infrastructure-as-Code (IaC) generation using DAPO (Direct Advantage Policy Optimization) - an advanced reinforcement learning technique that builds upon GRPO.

	## Model Description

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Base Model \| [inframind-0.5b-grpo](https://huggingface.co/srallabandi0225/inframind-0.5b-grpo) \|
	\| Original Base \| Qwen/Qwen2.5-0.5B-Instruct \|
	\| Parameters \| 500M \|
	\| Training Method \| DAPO (Direct Advantage Policy Optimization) \|
	\| Domain \| Infrastructure-as-Code \|
	\| License \| MIT \|

	### Training Pipeline

	```
	Qwen2.5-0.5B-Instruct → GRPO Training → inframind-grpo → DAPO Training → inframind-dapo
	(Stage 1) (Stage 2 - This Model)
	```

	This model is the second stage of InfraMind training, starting from the GRPO-trained checkpoint and applying DAPO innovations for enhanced learning.

	## What is DAPO?

	Direct Advantage Policy Optimization (DAPO) is an advanced RL algorithm that improves upon GRPO with four key innovations:

	\| Innovation \| Description \| Benefit \|
	\|------------\|-------------\|---------\|
	\| Clip-Higher \| Asymmetric clipping (ε_low=0.2, ε_high=0.28) \| Allows high-advantage tokens to be reinforced more strongly \|
	\| Dynamic Sampling \| Skip batches with uniform rewards \| Prevents entropy collapse, maintains exploration \|
	\| Token-Level Loss \| Per-token policy gradient \| Finer-grained credit assignment \|
	\| Overlong Punishment \| Soft length penalty \| Prevents verbose, repetitive outputs \|

	### Why DAPO After GRPO?

	\| Stage \| Method \| Purpose \|
	\|-------\|--------\|---------\|
	\| Stage 1 \| GRPO \| Establish IaC generation capability from base model \|
	\| Stage 2 \| DAPO \| Refine with advanced techniques for quality improvement \|

	## Evaluation Results

	\| Model \| Training Method \| Accuracy \| Pass Threshold \|
	\|-------\|-----------------\|----------\|----------------\|
	\| inframind-grpo \| GRPO \| 97.3% \| 0.6 \|
	\| inframind-dapo \| DAPO \| 96.4% \| 0.6 \|
	\| Base (Qwen2.5-0.5B) \| None \| ~30% \| 0.6 \|

	Evaluated on InfraMind-Bench (110 held-out test samples) across:
	- Terraform (AWS, GCP, Azure)
	- Kubernetes (Deployments, Services, Ingress)
	- Docker (Dockerfile, docker-compose)
	- CI/CD (GitHub Actions, GitLab CI)

	## Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load DAPO model
	model = AutoModelForCausalLM.from_pretrained("srallabandi0225/inframind-0.5b-dapo")
	tokenizer = AutoTokenizer.from_pretrained("srallabandi0225/inframind-0.5b-dapo")

	# Generate Terraform
	prompt = """### Instruction:
	Create Terraform for AWS EC2 instance
	### Input:
	t3.micro instance type
	### Response:
	"""

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Example Output

	```hcl
	resource "aws_instance" "web" {
	ami = "ami-0c55b159cbfafe1f0"
	instance_type = "t3.micro"

	tags = {
	Name = "web-server"
	}
	}
	```

	## Supported IaC Categories

	\| Category \| Examples \| Coverage \|
	\|----------\|----------\|----------\|
	\| Terraform \| EC2, S3, VPC, RDS, EKS, Lambda, IAM \| AWS, GCP, Azure \|
	\| Kubernetes \| Deployment, Service, Ingress, ConfigMap, RBAC \| All K8s resources \|
	\| Docker \| Dockerfile, docker-compose \| Multi-stage builds \|
	\| CI/CD \| GitHub Actions, GitLab CI, Jenkins \| Workflows, pipelines \|
	\| Ansible \| Playbooks, roles \| Server configuration \|
	\| Helm \| Charts, values.yaml \| K8s package management \|

	## Training Details

	### DAPO Configuration

	```yaml
	Training:
	epochs: 2
	batch_size: 16 (effective)
	learning_rate: 5e-6
	beta (KL): 0.0 # Pure DAPO - no KL penalty
	generations_per_prompt: 8

	DAPO Innovations:
	clip_higher:
	epsilon_low: 0.2
	epsilon_high: 0.28
	dynamic_sampling: true
	token_level_loss: true
	overlong_punishment:
	enabled: true
	soft_penalty: true

	LoRA:
	r: 16
	alpha: 32
	target_modules: [q_proj, k_proj, v_proj, o_proj]
	```

	### Reward Function

	Domain-specific reward for IaC quality:

	```
	Reward = α × Syntax + β × Correctness + γ × Format

	Where:
	- Syntax (α=0.4): Valid resource declarations
	- Correctness (β=0.3): Correct resource types
	- Format (γ=0.3): Proper structure
	```

	## GRPO vs DAPO Comparison

	\| Aspect \| GRPO \| DAPO \|
	\|--------\|------\|------\|
	\| KL Penalty \| β=0.04 \| β=0.0 (none) \|
	\| Clipping \| Symmetric \| Asymmetric (Clip-Higher) \|
	\| Loss Granularity \| Sequence-level \| Token-level \|
	\| Sampling \| All batches \| Dynamic (skip uniform) \|
	\| Length Control \| None \| Overlong punishment \|

	## Hardware Requirements

	\| Deployment \| Memory \| GPU \|
	\|------------\|--------\|-----\|
	\| Training \| 16GB+ \| A100/A10G \|
	\| Inference \| 2GB \| Optional \|
	\| Edge (Raspberry Pi 5) \| 4GB \| None \|

	The 0.5B model is small enough to run on edge devices, making it suitable for:
	- Air-gapped environments
	- Local development
	- CI/CD pipelines
	- IoT/Edge infrastructure

	## Limitations

	- IaC-specific: Optimized for infrastructure tasks, not general conversation
	- English only: Training data is in English
	- No execution: Generates code, does not execute or validate against real infrastructure
	- Version-sensitive: Generated code may use older API versions
	- Security: Always review generated code for security best practices

	### Out-of-Scope Uses

	- Legal or medical advice
	- General-purpose chatbot
	- Executing infrastructure changes without human review
	- Production deployment without validation

	## Intended Use

	### Primary Use Cases
	- Generating Terraform configurations
	- Creating Kubernetes manifests
	- Writing Dockerfiles and docker-compose
	- Building CI/CD pipelines
	- Infrastructure automation scripting

	### Users
	- DevOps engineers
	- Platform engineers
	- SREs
	- Cloud architects
	- Infrastructure developers

	## Training Data

	InfraMind-Bench: 2000+ IaC tasks in Alpaca format

	\| Category \| Tasks \|
	\|----------\|-------\|
	\| Terraform \| 500+ \|
	\| Kubernetes \| 400+ \|
	\| Docker \| 300+ \|
	\| CI/CD \| 300+ \|
	\| Ansible \| 200+ \|
	\| Helm \| 150+ \|
	\| Monitoring \| 150+ \|

	## Ethical Considerations

	- Model may generate insecure configurations if not prompted for security
	- Generated infrastructure code should always be reviewed before deployment
	- Model does not have access to real infrastructure or credentials
	- Users are responsible for validating generated code against their security policies

	## Citation

	```bibtex
	@misc{rallabandi2024inframind,
	title={InfraMind: Fine-tuning Small Language Models for Infrastructure-as-Code Generation with Reinforcement Learning},
	author={Rallabandi, Sai Kiran},
	year={2024},
	publisher={HuggingFace},
	url={https://huggingface.co/srallabandi0225/inframind-0.5b-dapo}
	}
	```

	## Links

	- GitHub: [github.com/saikiranrallabandi/inframind](https://github.com/saikiranrallabandi/inframind)
	- GRPO Model: [srallabandi0225/inframind-0.5b-grpo](https://huggingface.co/srallabandi0225/inframind-0.5b-grpo)
	- DAPO Model: [srallabandi0225/inframind-0.5b-dapo](https://huggingface.co/srallabandi0225/inframind-0.5b-dapo)

	## Acknowledgments

	- [Qwen Team](https://github.com/QwenLM/Qwen) for the base model
	- [DeepSeek](https://github.com/deepseek-ai) for GRPO
	- [NVIDIA NeMo](https://docs.nvidia.com/nemo) for DAPO reference
	- [TRL](https://github.com/huggingface/trl) for training infrastructure

	## Model Card Contact

	Author: Sai Kiran Rallabandi
	GitHub: [@saikiranrallabandi](https://github.com/saikiranrallabandi)