Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized)
Model Details
Model Description
This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance.
- Base Model: Llama 3.2 1B Instruct
- Quantization: q4f16_1 (4-bit weights with float16 scales)
- Format: MLC (Machine Learning Compilation)
- Model Type: Decoder-only Transformer
- Architecture: Llama
Model Specifications
| Parameter | Value |
|---|---|
| Parameters | 1.63B (quantized) |
| Hidden Size | 2,048 |
| Intermediate Size | 8,192 |
| Number of Layers | 16 |
| Number of Attention Heads | 32 |
| Number of Key-Value Heads | 8 (GQA) |
| Head Dimension | 64 |
| Vocabulary Size | 128,256 |
| Context Window | 131,072 tokens |
| Max Position Embeddings | 8,192 (with RoPE scaling factor: 32) |
| RMS Norm Epsilon | 1e-5 |
| Model Size (Quantized) | ~695 MB |
Quantization Details
- Quantization Method: q4f16_1
- Bits per Parameter: ~4.5 bits
- Weight Format: uint32 (packed 4-bit weights)
- Scale Format: float16
- Memory Reduction: ~75% compared to FP16
Intended Use
Primary Use Cases
- RadiPro AI assistant
- built for demonstration purposes
Training Data
This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation.
How to Use
Installation
First, install the MLC Chat package:
# For CPU (macOS/Linux)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu
# For CUDA (if you have NVIDIA GPU with CUDA 12.2)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122
# For Metal (macOS with Apple Silicon - M1/M2/M3)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal
Verify Installation:
After installation, verify that the package is correctly installed:
# Check if mlc_llm is available
python -c "import mlc_llm; print('mlc_llm installed successfully')"
# Verify the CLI command works
mlc_llm --help
For more installation options, see the MLC-LLM installation guide.
Using MLC Runtime (Python)
Note: The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (mlc_llm chat) is recommended.
For programmatic access, you can use the mlc_llm serve API:
from mlc_llm import MLCEngine
# Load the model
model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model_path, mode="local")
# Note: MLCEngine is designed for serving, not direct generation
# For interactive chat, use: mlc_llm chat <model-path>
For more details on the Python API, see the MLC-LLM Python API documentation.
Using Command Line
The simplest way to use the model is via the mlc_llm chat command:
# Interactive chat mode
mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work
Conversation Template
The model uses the Llama 3 conversation template:
<|start_header_id|>system<|end_header_id|>
{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>
{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{assistant_message}<|eot_id|>
Default Generation Parameters
- Temperature: 0.6
- Top-p: 0.9
- Repetition Penalty: 1.0
- Presence Penalty: 0.0
- Frequency Penalty: 0.0
Technical Details
Architecture
- Attention Mechanism: Grouped Query Attention (GQA) with 8 KV heads
- Position Encoding: RoPE (Rotary Position Embedding) with scaling
- Normalization: RMSNorm
- Activation: SwiGLU (in MLP layers)
- Tied Embeddings: Word embeddings are tied with output layer
Special Tokens
<|begin_of_text|>(BOS): 128000<|end_of_text|>(EOS): 128001<|eot_id|>(End of Turn): 128009<|start_header_id|>: 128006<|end_header_id|>: 128007
File Structure
.
βββ mlc-chat-config.json # MLC configuration
βββ tokenizer.json # Tokenizer model
βββ tokenizer_config.json # Tokenizer configuration
βββ tensor-cache.json # Tensor metadata
βββ params_shard_*.bin # Model weights (22 shards)
Ethical Considerations
Bias and Fairness
- The model may reflect biases present in the training data
- Users should evaluate outputs for potential biases
- Consider implementing bias detection and mitigation strategies
Safety
- The model may generate content that is inaccurate, offensive, or harmful
- Implement appropriate content filtering and safety measures
- Do not use for generating misleading or harmful content
Citation
If you use this model, please cite the original Llama 3.2 model:
@misc{llama3.2,
title={Llama 3.2},
author={Meta AI},
year={2024},
howpublished={\url{https://ai.meta.com/llama/}}
}
License
Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms.
Acknowledgments
- Meta AI for the original Llama 3.2 model
- MLC team for the compilation and quantization tools
- Downloads last month
- 17
Model tree for raditotev/radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC
Base model
meta-llama/Llama-3.2-1B-Instruct