SmolVLM INT4 OpenVINO

🚀 Optimized Vision-Language Model for Edge Deployment

This is an INT4 quantized version of SmolVLM-Instruct using OpenVINO, designed for efficient multimodal inference on edge devices and CPUs.

Model Overview

Base Model: SmolVLM-Instruct (2.25B parameters)
Quantization: INT4 via OpenVINO
Model Type: Vision-Language Model (VLM)
Capabilities: Image captioning, visual Q&A, multimodal reasoning
Target Hardware: CPUs, Intel GPUs, NPUs
Use Cases: On-device multimodal AI, edge vision applications

🔧 Technical Details

Quantization Process

# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Applied to both vision encoder and language decoder

Model Architecture

Vision Encoder: Shape-optimized SigLIP (INT4)
Text Decoder: SmolLM2 (INT4)
Visual tokens: 81 per 384×384 patch
Supports arbitrary image-text interleaving

📊 Performance (Experimental)

⚠️ Note: This is an experimental quantization. Formal benchmarks pending.

Expected benefits of INT4 quantization:

Significantly reduced model size
Faster inference on CPU/edge devices
Lower memory requirements for multimodal tasks
Maintained visual understanding capabilities

🛠️ How to Use

Installation

pip install optimum[openvino] transformers pillow

Basic Usage

from optimum.intel import OVModelForVision2Seq
from transformers import AutoProcessor
from PIL import Image
import requests

# Load model and processor
model_id = "dev-bjoern/smolvlm-int4-ov"
processor = AutoProcessor.from_pretrained(model_id)
model = OVModelForVision2Seq.from_pretrained(model_id)

# Load an image
url = "https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Create conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do you see in this image?"}
        ]
    }
]

# Process and generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
generated_ids = model.generate(**inputs, max_new_tokens=200)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(output[0])

Multiple Images

# Load multiple images
image1 = Image.open("path/to/image1.jpg")
image2 = Image.open("path/to/image2.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these two images"}
        ]
    }
]

# Process with multiple images
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")

🎯 Intended Use

Edge AI vision applications
Local multimodal assistants
Privacy-focused image analysis
Resource-constrained deployment
Real-time visual understanding

⚡ Optimization Tips

Image Resolution: Adjust with size={"longest_edge": N*384} where N=3 or 4 for balance
Batch Processing: Process multiple images together when possible
CPU Inference: Leverage OpenVINO runtime optimizations

🧪 Experimental Status

This is my first experiment with OpenVINO INT4 quantization for vision-language models. Feedback welcome!

Known Limitations

No formal benchmarks yet
Visual quality degradation not measured
Optimal quantization settings still being explored

Future Improvements

Benchmark on standard VLM tasks
Compare with original model performance
Experiment with mixed precision
Test on various hardware configurations

🤝 Contributing

Have suggestions or found issues? Please open a discussion!

📚 Resources

🙏 Acknowledgments

HuggingFace team for SmolVLM
Intel OpenVINO team for quantization tools
Vision-language model community

📝 Citation

If you use this model, please cite both works:

@misc{smolvlm-int4-ov,
  author = {Bjoern Bethge},
  title = {SmolVLM INT4 OpenVINO},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/dev-bjoern/smolvlm-int4-ov}}
}

@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models},
  author={Andrés Marafioti and others},
  journal={arXiv preprint arXiv:2504.05299},
  year={2025}
}

Status: 🧪 Experimental | Model Type: Vision-Language | License: Apache 2.0

Downloads last month: 1

Model tree for dev-bjoern/smolvlm-int4-ov

Base model

HuggingFaceTB/SmolLM2-1.7B

Quantized

HuggingFaceTB/SmolLM2-1.7B-Instruct

Quantized

HuggingFaceTB/SmolVLM-Instruct

Finetuned

(145)

this model

Paper for dev-bjoern/smolvlm-int4-ov

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 203