Llama-3.2-1B-FP8-Neuron

This is an FP8-quantized version of Meta's Llama 3.2 1B model, specifically optimized for efficient inference on AWS Neuron accelerators (Inferentia2). The model has been compiled and quantized using AWS Neuron SDK to leverage the specialized AI acceleration capabilities of AWS Neuron chips.

Model Details

Model Description

This model is a deployment-optimized version of Llama 3.2 1B that has been quantized to FP8 precision and compiled for AWS Neuron devices. AWS Neuron is the SDK used to run deep learning workloads on AWS Inferentia and Trainium chips, which are purpose-built machine learning accelerators.
Note: For better performance set Tp_degree=8 on Inf2.24xlarge [Total Token Throughput = ~2.5k tokens/sec]

Key Features

Reduced memory footprint through FP8 quantization (~50% reduction from FP16)
Optimized for AWS Inferentia2 instances
Pre-compiled for tensor parallelism across 2 NeuronCores
Maintains instruction-following capabilities of the base model
Cost-effective LLM serving with improved throughput

Model Specifications

Specification	Value
Base Model	meta-llama/Llama-3.2-1B
Quantization	FP8 E4M3 (IEEE-754 FP8_EXP4 format)
Optimization Target	AWS Inferentia2 NeuronCores
Tensor Parallelism Degree	2
Recommended Hardware	AWS inf2.8xlarge
Max Sequence Length	8192 tokens
Developed by	Fraser Sequeira

Quick Start

Prerequisites

Launch an inf2.8xlarge Ubuntu EC2 instance on AWS
Select the 'Deep Learning AMI Neuron (Ubuntu 22.04)' AMI

Installation & Setup

1. Launch Docker Container

docker run \
  -it \
  --device=/dev/neuron0 \
  --cap-add SYS_ADMIN \
  --cap-add IPC_LOCK \
  -p 8080:8080 \
  --name llama3-2-1B \
  public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04 \
  bash

On Inf2.24xlarge [Optional]

docker run \
  -it \
  --device=/dev/neuron0 \
  --device=/dev/neuron1 \
  --device=/dev/neuron2 \
  --device=/dev/neuron3 \
  --device=/dev/neuron4 \
  --device=/dev/neuron5 \
  --cap-add SYS_ADMIN \
  --cap-add IPC_LOCK \
  -p 8080:8080 \
  --name llama3-2-1B \
  public.ecr.aws/neuron/pytorch-inference-vllm-neuronx:0.9.1-neuronx-py311-sdk2.26.0-ubuntu22.04 \
  bash

Install Dependencies

Install required dependencies

pip install -U "huggingface_hub[cli]"

Optional dependencies for benchmarking

pip install pandas datasets

Configure Hugging Face Access

export HF_TOKEN=<your-huggingface-token>

Download the Model

hf download fraseque/llama-3.2-1B-FP8-Neuron

Set Model Path The model is typically saved to:

/root/.cache/huggingface/hub/models--fraseque--llama-3.2-1B-FP8-Neuron/snapshots/{{uuid}}

Replace {{uuid}} with the actual snapshot ID

export MODEL_PATH=/root/.cache/huggingface/hub/models--fraseque--llama-3.2-1B-FP8-Neuron/snapshots/{{uuid}}

(Optional) Use Pre-compiled Artifacts To skip compilation[This step takes 5-10 minutes] you can use pre-compiled artifacts available as part of this repository:

export NEURON_COMPILED_ARTIFACTS=$MODEL_PATH/neuron-compiled-artifacts/0a7a59fd2142874207e2f96474f27309

Serve the Model

VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference' python -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
--device "neuron" \
--tensor-parallel-size 2 \
--max-num-seqs 16 \
--max-model-len 8192 \
--port 8080 \
--override-neuron-config "{\"enable_bucketing\": true, \"context_encoding_buckets\": [128,512,1024,2048,4096,8192], \"token_generation_buckets\": [128,512,1024,2048,4096,8192], \"max_context_length\": 8192, \"use-v2-block-manager\": true, \"modules_to_not_convert\": [\"lm_head\", \"embed_tokens\"], \"seq_len\": 8192, \"quantization_dtype\":\"f8e4m3\", \"quantization_type\": \"per_channel_symmetric\", \"quantized_checkpoints_path\":\"$MODEL_PATH\", \"quantized\": true, \"batch_size\": 1, \"ctx_batch_size\": 1, \"tkg_batch_size\": 1, \"attn_kernel_enabled\": true, \"sequence_parallel_enabled\": true, \"is_continuous_batching\": true}"

Making Inference Requests Once the server is running on Port 8080, you can make requests as follows: Open another terminal and fire the below CURL request

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "prompt": "<|system|>You are a helpful AI assistant.<|user|>What is the capital of France?<|assistant|>",
  "max_tokens": 100,
  "temperature": 0.1,
  "top_p": 0.9,
  "stop": ["<|system|>", "<|user|>", "<|assistant|>", "<|end|>", "\n\n"]
}'

Benchmarking Performance Open another terminal set the MODEL_PATH and fire the below benchmark command

cd /opt/vllm/benchmarks
python3 benchmark_serving.py --backend vllm --base-url http://127.0.0.1:8080 --dataset-name=random --model $MODEL_PATH --num-prompts 20 --max-concurrency 5 --request-rate inf --random-input-len 4000 --random-output-len 500 --seed 12345

Results on Inf2.24xlarge [6 neuron cores]

Quantization Details Quantization Format FP8 E4M3 (8-bit floating point) Quantization Type Per-channel symmetric Tensor Parallelism (TP) 2 Target Accelerator AWS Inferentia2 Instance Type inf2.8xlarge Sequence Length 8192 tokens Use Cases Intended Use This model is optimized for:

✅ Production inference deployments on AWS Inferentia2 instances ✅ Cost-effective LLM serving with reduced computational requirements ✅ Conversational AI applications requiring instruction-following ✅ Text generation tasks (Q&A, summarization, creative writing) ✅ Low-latency inference requirements

Benefits of FP8 Quantization ~50% memory reduction compared to FP16 Improved throughput on Neuron accelerators Lower inference costs on AWS infrastructure Maintained accuracy with minimal degradation

Out-of-Scope Use This model is NOT suitable for: ❌ Deployment on non-Neuron hardware (GPUs, CPUs) without recompilation

Limitations and Considerations Quantization artifacts: FP8 quantization may introduce minor accuracy degradation compared to full-precision models Hardware dependency: Compiled specifically for Neuron devices; requires recompilation for other hardware Max Sequence Length 8192 tokens

Citation @misc{llama32-1b-fp8-neuron, author = {Sequeira, Fraser}, title = {Llama-3.2-1B-FP8-Neuron}, year = {2024}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/fraseque/llama-3.2-1B-FP8-Neuron}} }

Model Card Authors

Fraser Sequeira

Acknowledgments Base model: Meta's Llama 3.2 1B Quantization and compilation: AWS Neuron SDK [NEURONX_DISTRIBUTED_INFERENCE] Inference framework: vLLM with Neuron support License This model inherits the Llama 3.2 license from Meta. Please refer to the official license for terms and conditions.

Downloads last month: 58

Safetensors

Model size

1B params

Tensor type

F32

F8_E4M3

Model tree for fraseque/llama-3.2-1B-FP8-Neuron

Base model

meta-llama/Llama-3.2-1B

Finetuned

(734)

this model