---
license: llama3.1
datasets:
- nvidia/Aegis-AI-Content-Safety-Dataset-2.0
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
- nvidia/llama-3.1-nemoguard-8b-content-safety
---

# 🦙 Llama-3.1-NemoGuard-8B Content Safety — Merged FP8 Dynamic

This repository contains an **FP8-quantized** version of the `llama-3.1-nemoguard-8b-content-safety` model, obtained by merging and post-training quantization for optimized inference with **vLLM**.

---

## 🔧 Model Overview

**Model Name:** `GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic`  
**Base Model:** [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)  
**Adapter:** [`nvidia/llama-3.1-nemoguard-8b-content-safety`](https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety)  
**Quantization:** FP8 Dynamic (W8A8)  
**Target Use:** Fast and memory-efficient inference with vLLM on Hopper/Ada GPUs

---

## 🧩 Model Merging

The model was created by merging the **Meta Llama 3.1 8B Instruct** base model with **NVIDIA NemoGuard 8B Content Safety**, a LoRA adapter designed to improve moderation and content safety performance.

**Merging pipeline:**

1. **Base:** `meta-llama/Llama-3.1-8B-Instruct`  
2. **Adapter:** `nvidia/llama-3.1-nemoguard-8b-content-safety`  
3. **Merge Method:** LoRA merge (adapter weights merged into base model)  

This results in a single, consolidated model that retains the instruction-following ability of Llama 3.1 with the content moderation features of NemoGuard.

---

## ⚙️ FP8 Quantization Details

After merging, the model was quantized to **FP8** using **Post-Training Quantization (PTQ)** with the `llmcompressor` library.  
The **FP8_DYNAMIC** quantization scheme was applied to all linear layers except the language modeling head (`lm_head`).

**Quantization code example:**

```python
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Configure simple PTQ quantization
recipe = QuantizationModifier(
    targets="Linear", 
    scheme="FP8_DYNAMIC", 
    ignore=["lm_head"]
)

# Apply the quantization algorithm
oneshot(model=model, recipe=recipe)

# Save the quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
```

---

## ⚡ Hardware Compatibility

vLLM supports **FP8 (W8A8)** quantization using hardware acceleration only on:
- **NVIDIA Hopper (H100)** GPUs  
- **AMD MI300x** GPUs  
- (Experimental) **Ada Lovelace** architecture

Other GPUs may fall back to slower software implementations.

---

## 💻 Usage Example (Transformers)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Explain why ethical AI moderation is important."

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## 🚀 Example with vLLM

To leverage the FP8 quantization efficiently, you can load and run the model with **vLLM**:

```bash
pip install vllm
```

```python
from vllm import LLM, SamplingParams

# Load the FP8-quantized model with vLLM
model_id = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic"
llm = LLM(model=model_id)

sampling_params = SamplingParams(temperature=0.7, max_tokens=128)

prompt = "List three key principles for responsible AI deployment."

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
```

> **Note:** Ensure you are running on supported FP8 hardware (e.g., NVIDIA H100) for optimal speed and accuracy.

---

## 📚 References

- [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [NVIDIA NemoGuard Content Safety](https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety)
- [llmcompressor Documentation](https://docs.vllm.ai/projects/llm-compressor/en/latest)
- [vLLM Official Repository](https://github.com/vllm-project/vllm)

---

## 🏷️ License

This model follows the licenses and terms of use of:
- Meta Llama 3.1  
- NVIDIA NemoGuard

Please ensure compliance with all applicable licenses and usage restrictions.

---

✨ *Maintained by [GaleneAI](https://huggingface.co/GaleneAI)*