SmolVLM INT4 OpenVINO
π Optimized Vision-Language Model for Edge Deployment
This is an INT4 quantized version of SmolVLM-Instruct using OpenVINO, designed for efficient multimodal inference on edge devices and CPUs.
Model Overview
- Base Model: SmolVLM-Instruct (2.25B parameters)
- Quantization: INT4 via OpenVINO
- Model Type: Vision-Language Model (VLM)
- Capabilities: Image captioning, visual Q&A, multimodal reasoning
- Target Hardware: CPUs, Intel GPUs, NPUs
- Use Cases: On-device multimodal AI, edge vision applications
π§ Technical Details
Quantization Process
# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Applied to both vision encoder and language decoder
Model Architecture
- Vision Encoder: Shape-optimized SigLIP (INT4)
- Text Decoder: SmolLM2 (INT4)
- Visual tokens: 81 per 384Γ384 patch
- Supports arbitrary image-text interleaving
π Performance (Experimental)
β οΈ Note: This is an experimental quantization. Formal benchmarks pending.
Expected benefits of INT4 quantization:
- Significantly reduced model size
- Faster inference on CPU/edge devices
- Lower memory requirements for multimodal tasks
- Maintained visual understanding capabilities
π οΈ How to Use
Installation
pip install optimum[openvino] transformers pillow
Basic Usage
from optimum.intel import OVModelForVision2Seq
from transformers import AutoProcessor
from PIL import Image
import requests
# Load model and processor
model_id = "dev-bjoern/smolvlm-int4-ov"
processor = AutoProcessor.from_pretrained(model_id)
model = OVModelForVision2Seq.from_pretrained(model_id)
# Load an image
url = "https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Create conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do you see in this image?"}
        ]
    }
]
# Process and generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
generated_ids = model.generate(**inputs, max_new_tokens=200)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(output[0])
Multiple Images
# Load multiple images
image1 = Image.open("path/to/image1.jpg")
image2 = Image.open("path/to/image2.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these two images"}
        ]
    }
]
# Process with multiple images
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
π― Intended Use
- Edge AI vision applications
- Local multimodal assistants
- Privacy-focused image analysis
- Resource-constrained deployment
- Real-time visual understanding
β‘ Optimization Tips
- Image Resolution: Adjust with size={"longest_edge": N*384}where N=3 or 4 for balance
- Batch Processing: Process multiple images together when possible
- CPU Inference: Leverage OpenVINO runtime optimizations
π§ͺ Experimental Status
This is my first experiment with OpenVINO INT4 quantization for vision-language models. Feedback welcome!
Known Limitations
- No formal benchmarks yet
- Visual quality degradation not measured
- Optimal quantization settings still being explored
Future Improvements
- Benchmark on standard VLM tasks
- Compare with original model performance
- Experiment with mixed precision
- Test on various hardware configurations
π€ Contributing
Have suggestions or found issues? Please open a discussion!
π Resources
π Acknowledgments
- HuggingFace team for SmolVLM
- Intel OpenVINO team for quantization tools
- Vision-language model community
π Citation
If you use this model, please cite both works:
@misc{smolvlm-int4-ov,
  author = {Bjoern Bethge},
  title = {SmolVLM INT4 OpenVINO},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/dev-bjoern/smolvlm-int4-ov}}
}
@article{marafioti2025smolvlm,
  title={SmolVLM: Redefining small and efficient multimodal models},
  author={AndrΓ©s Marafioti and others},
  journal={arXiv preprint arXiv:2504.05299},
  year={2025}
}
Status: π§ͺ Experimental | Model Type: Vision-Language | License: Apache 2.0
- Downloads last month
- 14
Model tree for dev-bjoern/smolvlm-int4-ov
Base model
HuggingFaceTB/SmolLM2-1.7B
				Quantized
	
	
HuggingFaceTB/SmolLM2-1.7B-Instruct
						
				Quantized
	
	
HuggingFaceTB/SmolVLM-Instruct