SmolVLM: Redefining small and efficient multimodal models
Paper
β’
2504.05299
β’
Published
β’
203
This is an INT4 quantized version of SmolVLM-Instruct using OpenVINO, designed for efficient multimodal inference on edge devices and CPUs.
# Quantized using OpenVINO NNCF
# INT4 symmetric quantization
# Applied to both vision encoder and language decoder
β οΈ Note: This is an experimental quantization. Formal benchmarks pending.
Expected benefits of INT4 quantization:
pip install optimum[openvino] transformers pillow
from optimum.intel import OVModelForVision2Seq
from transformers import AutoProcessor
from PIL import Image
import requests
# Load model and processor
model_id = "dev-bjoern/smolvlm-int4-ov"
processor = AutoProcessor.from_pretrained(model_id)
model = OVModelForVision2Seq.from_pretrained(model_id)
# Load an image
url = "https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Create conversation
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What do you see in this image?"}
]
}
]
# Process and generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
generated_ids = model.generate(**inputs, max_new_tokens=200)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(output[0])
# Load multiple images
image1 = Image.open("path/to/image1.jpg")
image2 = Image.open("path/to/image2.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "Compare these two images"}
]
}
]
# Process with multiple images
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
size={"longest_edge": N*384} where N=3 or 4 for balanceThis is my first experiment with OpenVINO INT4 quantization for vision-language models. Feedback welcome!
Have suggestions or found issues? Please open a discussion!
If you use this model, please cite both works:
@misc{smolvlm-int4-ov,
author = {Bjoern Bethge},
title = {SmolVLM INT4 OpenVINO},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/dev-bjoern/smolvlm-int4-ov}}
}
@article{marafioti2025smolvlm,
title={SmolVLM: Redefining small and efficient multimodal models},
author={AndrΓ©s Marafioti and others},
journal={arXiv preprint arXiv:2504.05299},
year={2025}
}
Status: π§ͺ Experimental | Model Type: Vision-Language | License: Apache 2.0
Base model
HuggingFaceTB/SmolLM2-1.7B