--- license: llama3.1 datasets: - nvidia/Aegis-AI-Content-Safety-Dataset-2.0 language: - en base_model: - meta-llama/Llama-3.1-8B-Instruct - nvidia/llama-3.1-nemoguard-8b-content-safety --- # 🦙 Llama-3.1-NemoGuard-8B Content Safety — Merged FP8 Dynamic This repository contains an **FP8-quantized** version of the `llama-3.1-nemoguard-8b-content-safety` model, obtained by merging and post-training quantization for optimized inference with **vLLM**. --- ## 🔧 Model Overview **Model Name:** `GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic` **Base Model:** [`meta-llama/Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) **Adapter:** [`nvidia/llama-3.1-nemoguard-8b-content-safety`](https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety) **Quantization:** FP8 Dynamic (W8A8) **Target Use:** Fast and memory-efficient inference with vLLM on Hopper/Ada GPUs --- ## 🧩 Model Merging The model was created by merging the **Meta Llama 3.1 8B Instruct** base model with **NVIDIA NemoGuard 8B Content Safety**, a LoRA adapter designed to improve moderation and content safety performance. **Merging pipeline:** 1. **Base:** `meta-llama/Llama-3.1-8B-Instruct` 2. **Adapter:** `nvidia/llama-3.1-nemoguard-8b-content-safety` 3. **Merge Method:** LoRA merge (adapter weights merged into base model) This results in a single, consolidated model that retains the instruction-following ability of Llama 3.1 with the content moderation features of NemoGuard. --- ## ⚙️ FP8 Quantization Details After merging, the model was quantized to **FP8** using **Post-Training Quantization (PTQ)** with the `llmcompressor` library. The **FP8_DYNAMIC** quantization scheme was applied to all linear layers except the language modeling head (`lm_head`). **Quantization code example:** ```python from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier # Configure simple PTQ quantization recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] ) # Apply the quantization algorithm oneshot(model=model, recipe=recipe) # Save the quantized model SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" model.save_pretrained(SAVE_DIR) tokenizer.save_pretrained(SAVE_DIR) ``` --- ## ⚡ Hardware Compatibility vLLM supports **FP8 (W8A8)** quantization using hardware acceleration only on: - **NVIDIA Hopper (H100)** GPUs - **AMD MI300x** GPUs - (Experimental) **Ada Lovelace** architecture Other GPUs may fall back to slower software implementations. --- ## 💻 Usage Example (Transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) prompt = "Explain why ethical AI moderation is important." inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## 🚀 Example with vLLM To leverage the FP8 quantization efficiently, you can load and run the model with **vLLM**: ```bash pip install vllm ``` ```python from vllm import LLM, SamplingParams # Load the FP8-quantized model with vLLM model_id = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic" llm = LLM(model=model_id) sampling_params = SamplingParams(temperature=0.7, max_tokens=128) prompt = "List three key principles for responsible AI deployment." outputs = llm.generate([prompt], sampling_params) print(outputs[0].outputs[0].text) ``` > **Note:** Ensure you are running on supported FP8 hardware (e.g., NVIDIA H100) for optimal speed and accuracy. --- ## 📚 References - [Meta Llama 3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) - [NVIDIA NemoGuard Content Safety](https://huggingface.co/nvidia/llama-3.1-nemoguard-8b-content-safety) - [llmcompressor Documentation](https://docs.vllm.ai/projects/llm-compressor/en/latest) - [vLLM Official Repository](https://github.com/vllm-project/vllm) --- ## 🏷️ License This model follows the licenses and terms of use of: - Meta Llama 3.1 - NVIDIA NemoGuard Please ensure compliance with all applicable licenses and usage restrictions. --- ✨ *Maintained by [GaleneAI](https://huggingface.co/GaleneAI)*