Radipro Chatbot - Llama 3.2 1B Instruct (MLC Quantized)

Model Details

Model Description

This is a quantized version of the Llama 3.2 1B Instruct model, optimized for deployment using Machine Learning Compilation (MLC). The model has been quantized to 4-bit precision (q4f16_1) to reduce memory footprint while maintaining reasonable performance.

  • Base Model: Llama 3.2 1B Instruct
  • Quantization: q4f16_1 (4-bit weights with float16 scales)
  • Format: MLC (Machine Learning Compilation)
  • Model Type: Decoder-only Transformer
  • Architecture: Llama

Model Specifications

Parameter Value
Parameters 1.63B (quantized)
Hidden Size 2,048
Intermediate Size 8,192
Number of Layers 16
Number of Attention Heads 32
Number of Key-Value Heads 8 (GQA)
Head Dimension 64
Vocabulary Size 128,256
Context Window 131,072 tokens
Max Position Embeddings 8,192 (with RoPE scaling factor: 32)
RMS Norm Epsilon 1e-5
Model Size (Quantized) ~695 MB

Quantization Details

  • Quantization Method: q4f16_1
  • Bits per Parameter: ~4.5 bits
  • Weight Format: uint32 (packed 4-bit weights)
  • Scale Format: float16
  • Memory Reduction: ~75% compared to FP16

Intended Use

Primary Use Cases

  • RadiPro AI assistant
  • built for demonstration purposes

Training Data

This model is based on Meta's Llama 3.2 1B Instruct model. The base model was trained on a small set of synthetic data: 49 training Q/A and 4 validation.

How to Use

Installation

First, install the MLC Chat package:

# For CPU (macOS/Linux)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

# For CUDA (if you have NVIDIA GPU with CUDA 12.2)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122

# For Metal (macOS with Apple Silicon - M1/M2/M3)
python -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-metal mlc-ai-nightly-metal

Verify Installation:

After installation, verify that the package is correctly installed:

# Check if mlc_llm is available
python -c "import mlc_llm; print('mlc_llm installed successfully')"

# Verify the CLI command works
mlc_llm --help

For more installation options, see the MLC-LLM installation guide.

Using MLC Runtime (Python)

Note: The Python API for MLC-LLM is primarily designed for serving. For interactive use, the command-line interface (mlc_llm chat) is recommended.

For programmatic access, you can use the mlc_llm serve API:

from mlc_llm import MLCEngine

# Load the model
model_path = "./radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC"
engine = MLCEngine(model_path, mode="local")

# Note: MLCEngine is designed for serving, not direct generation
# For interactive chat, use: mlc_llm chat <model-path>

For more details on the Python API, see the MLC-LLM Python API documentation.

Using Command Line

The simplest way to use the model is via the mlc_llm chat command:

# Interactive chat mode
mlc_llm chat radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC # or python -m mlc_llm chat ... if it doesn't work

Conversation Template

The model uses the Llama 3 conversation template:

<|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>

Default Generation Parameters

  • Temperature: 0.6
  • Top-p: 0.9
  • Repetition Penalty: 1.0
  • Presence Penalty: 0.0
  • Frequency Penalty: 0.0

Technical Details

Architecture

  • Attention Mechanism: Grouped Query Attention (GQA) with 8 KV heads
  • Position Encoding: RoPE (Rotary Position Embedding) with scaling
  • Normalization: RMSNorm
  • Activation: SwiGLU (in MLP layers)
  • Tied Embeddings: Word embeddings are tied with output layer

Special Tokens

  • <|begin_of_text|> (BOS): 128000
  • <|end_of_text|> (EOS): 128001
  • <|eot_id|> (End of Turn): 128009
  • <|start_header_id|>: 128006
  • <|end_header_id|>: 128007

File Structure

.
β”œβ”€β”€ mlc-chat-config.json      # MLC configuration
β”œβ”€β”€ tokenizer.json            # Tokenizer model
β”œβ”€β”€ tokenizer_config.json     # Tokenizer configuration
β”œβ”€β”€ tensor-cache.json         # Tensor metadata
└── params_shard_*.bin        # Model weights (22 shards)

Ethical Considerations

Bias and Fairness

  • The model may reflect biases present in the training data
  • Users should evaluate outputs for potential biases
  • Consider implementing bias detection and mitigation strategies

Safety

  • The model may generate content that is inaccurate, offensive, or harmful
  • Implement appropriate content filtering and safety measures
  • Do not use for generating misleading or harmful content

Citation

If you use this model, please cite the original Llama 3.2 model:

@misc{llama3.2,
  title={Llama 3.2},
  author={Meta AI},
  year={2024},
  howpublished={\url{https://ai.meta.com/llama/}}
}

License

Please refer to the license of the base Llama 3.2 model. This quantized version follows the same licensing terms.

Acknowledgments

  • Meta AI for the original Llama 3.2 model
  • MLC team for the compilation and quantization tools
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for raditotev/radipro-chatbot-Llama-3.2-1B-Instruct-q4f16_1-MLC

Finetuned
(1155)
this model