πŸ¦™ Llama-3.1-NemoGuard-8B Content Safety β€” Merged FP8 Dynamic

This repository contains an FP8-quantized version of the llama-3.1-nemoguard-8b-content-safety model, obtained by merging and post-training quantization for optimized inference with vLLM.


πŸ”§ Model Overview

Model Name: GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic
Base Model: meta-llama/Llama-3.1-8B-Instruct
Adapter: nvidia/llama-3.1-nemoguard-8b-content-safety
Quantization: FP8 Dynamic (W8A8)
Target Use: Fast and memory-efficient inference with vLLM on Hopper/Ada GPUs


🧩 Model Merging

The model was created by merging the Meta Llama 3.1 8B Instruct base model with NVIDIA NemoGuard 8B Content Safety, a LoRA adapter designed to improve moderation and content safety performance.

Merging pipeline:

  1. Base: meta-llama/Llama-3.1-8B-Instruct
  2. Adapter: nvidia/llama-3.1-nemoguard-8b-content-safety
  3. Merge Method: LoRA merge (adapter weights merged into base model)

This results in a single, consolidated model that retains the instruction-following ability of Llama 3.1 with the content moderation features of NemoGuard.


βš™οΈ FP8 Quantization Details

After merging, the model was quantized to FP8 using Post-Training Quantization (PTQ) with the llmcompressor library.
The FP8_DYNAMIC quantization scheme was applied to all linear layers except the language modeling head (lm_head).

Quantization code example:

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Configure simple PTQ quantization
recipe = QuantizationModifier(
    targets="Linear", 
    scheme="FP8_DYNAMIC", 
    ignore=["lm_head"]
)

# Apply the quantization algorithm
oneshot(model=model, recipe=recipe)

# Save the quantized model
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

⚑ Hardware Compatibility

vLLM supports FP8 (W8A8) quantization using hardware acceleration only on:

  • NVIDIA Hopper (H100) GPUs
  • AMD MI300x GPUs
  • (Experimental) Ada Lovelace architecture

Other GPUs may fall back to slower software implementations.


πŸ’» Usage Example (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "Explain why ethical AI moderation is important."

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

πŸš€ Example with vLLM

To leverage the FP8 quantization efficiently, you can load and run the model with vLLM:

pip install vllm
from vllm import LLM, SamplingParams

# Load the FP8-quantized model with vLLM
model_id = "GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic"
llm = LLM(model=model_id)

sampling_params = SamplingParams(temperature=0.7, max_tokens=128)

prompt = "List three key principles for responsible AI deployment."

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Note: Ensure you are running on supported FP8 hardware (e.g., NVIDIA H100) for optimal speed and accuracy.


πŸ“š References


🏷️ License

This model follows the licenses and terms of use of:

  • Meta Llama 3.1
  • NVIDIA NemoGuard

Please ensure compliance with all applicable licenses and usage restrictions.


✨ Maintained by GaleneAI

Downloads last month
243
Safetensors
Model size
8B params
Tensor type
F32
Β·
F8_E4M3
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic

Quantized
(515)
this model

Dataset used to train GaleneAI/llama-3.1-nemoguard-8b-content-safety-merged-FP8-Dynamic