Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized)

Model Details

Model Description

This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance.

Base Model: Llama 3.2 1B Instruct
Quantization: q4f16_1 (4-bit weights with float16 scales)
Format: MLC (Machine Learning Compilation)
Model Type: Decoder-only Transformer
Architecture: Llama

Model Specifications

Parameter	Value
Parameters	1.63B (quantized)
Hidden Size	2,048
Intermediate Size	8,192
Number of Layers	16
Number of Attention Heads	32
Number of Key-Value Heads	8 (GQA)
Head Dimension	64
Vocabulary Size	128,256
Context Window	131,072 tokens
Max Position Embeddings	8,192 (with RoPE scaling factor: 32)
RMS Norm Epsilon	1e-5
Model Size (Quantized)	~695 MB

Quantization Details

Quantization Method: q4f16_1
Bits per Parameter: ~4.5 bits
Weight Format: uint32 (packed 4-bit weights)
Scale Format: float16
Memory Reduction: ~75% compared to FP16

Intended Use

Primary Use Cases

RadiPro AI assistant
built for demonstration purposes

Training Data

This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation.

How to Use

Installation

First, install the MLC Chat package:

# For CPU (macOS/Linux)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

# For CUDA (if you have NVIDIA GPU with CUDA 12.2)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122

# For Metal (macOS with Apple Silicon - M1/M2/M3)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal

Verify Installation:

After installation, verify that the package is correctly installed:

# Check if mlc_llm is available
python -c "import mlc_llm; print('mlc_llm installed successfully')"

# Verify the CLI command works
mlc_llm --help

For more installation options, see the MLC-LLM installation guide.

Using MLC Runtime (Python)

Note: The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (mlc_llm chat) is recommended.

For programmatic access, you can use the mlc_llm serve API:

from mlc_llm import MLCEngine

# Load the model
model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model_path, mode="local")

# Note: MLCEngine is designed for serving, not direct generation
# For interactive chat, use: mlc_llm chat <model-path>

For more details on the Python API, see the MLC-LLM Python API documentation.

Using Command Line

The simplest way to use the model is via the mlc_llm chat command:

# Interactive chat mode
mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work

Conversation Template

The model uses the Llama 3 conversation template:

<|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>

Default Generation Parameters

Temperature: 0.6
Top-p: 0.9
Repetition Penalty: 1.0
Presence Penalty: 0.0
Frequency Penalty: 0.0

Technical Details

Architecture

Attention Mechanism: Grouped Query Attention (GQA) with 8 KV heads
Position Encoding: RoPE (Rotary Position Embedding) with scaling
Normalization: RMSNorm
Activation: SwiGLU (in MLP layers)
Tied Embeddings: Word embeddings are tied with output layer

Special Tokens

<|begin_of_text|> (BOS): 128000
<|end_of_text|> (EOS): 128001
<|eot_id|> (End of Turn): 128009
<|start_header_id|>: 128006
<|end_header_id|>: 128007

File Structure

.
├── mlc-chat-config.json      # MLC configuration
├── tokenizer.json            # Tokenizer model
├── tokenizer_config.json     # Tokenizer configuration
├── tensor-cache.json         # Tensor metadata
└── params_shard_*.bin        # Model weights (22 shards)

Ethical Considerations

Bias and Fairness

The model may reflect biases present in the training data
Users should evaluate outputs for potential biases
Consider implementing bias detection and mitigation strategies

Safety

The model may generate content that is inaccurate, offensive, or harmful
Implement appropriate content filtering and safety measures
Do not use for generating misleading or harmful content

Citation

If you use this model, please cite the original Llama 3.2 model:

@misc{llama3.2,
  title={Llama 3.2},
  author={Meta AI},
  year={2024},
  howpublished={\url{https://ai.meta.com/llama/}}
}

License

Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms.

Acknowledgments

Meta AI for the original Llama 3.2 model
MLC team for the compilation and quantization tools

Downloads last month: 17

Model tree for raditotev/radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

(1155)

this model