LFM2‑VL

LFM2-VL-3B is the newest and most capable model in Liquid AI's multimodal LFM2-VL series, designed to process text and images with variable resolutions.
Built on the LFM2 backbone, it extends the architecture for higher-capacity reasoning and stronger visual understanding while retaining efficiency.

We are releasing the weights of the new 3B checkpoint—offering higher performance across benchmarks while remaining optimized for scalable deployment.

Competitive multimodal performance among lightweight open models.
Enhanced visual understanding and reasoning, particularly on fine-grained perception tasks
Retains efficient inference with the same flexible architecture and user-tunable speed-quality tradeoffs
Processes native resolutions up to 512×512 with intelligent patch-based handling for larger inputs

For more details, see the LFM2-VL-3B post and the LFM2 blog post.

📄 Model details

Due to their small size, we recommend fine-tuning LFM2-VL models on narrow use cases to maximize performance. They were trained for instruction following and lightweight agentic flows. Not intended for safety‑critical decisions.

Property	LFM2-VL-450M	LFM2-VL-1.6B	LFM2-VL-3B
Parameters (LM only)	350M	1.2B	2.6B
Vision encoder	SigLIP2 NaFlex base (86M)	SigLIP2 NaFlex shape-optimized (400M)	SigLIP2 NaFlex large (400M)
Backbone layers	hybrid conv+attention	hybrid conv+attention	hybrid conv+attention
Context (text)	32,768 tokens	32,768 tokens	32,768 tokens
Image tokens	dynamic, user-tunable	dynamic, user-tunable	dynamic, user-tunable
Vocab size	65,536	65,536	65,536
Precision	bfloat16	bfloat16	bfloat16
License	LFM Open License v1.0	LFM Open License v1.0	LFM Open License v1.0

Supported languages: English

Generation parameters: We recommend the following parameters:

Text: temperature=0.1, min_p=0.15, repetition_penalty=1.05
Vision: min_image_tokens=64 max_image_tokens=256, do_image_splitting=True

Chat template: LFM2-VL uses a ChatML-like chat template as follows:

<|startoftext|><|im_start|>system
You are a helpful multimodal assistant by Liquid AI.<|im_end|>
<|im_start|>user
<image>Describe this image.<|im_end|>
<|im_start|>assistant
This image shows a Caenorhabditis elegans (C. elegans) nematode.<|im_end|>

Images are referenced with a sentinel (<image>), which is automatically replaced with the image tokens by the processor.

You can apply it using the dedicated .apply_chat_template() function from Hugging Face transformers.

Architecture

Hybrid backbone: Language model tower (LFM2-2.6B) paired with SigLIP2 NaFlex vision encoders (400M shape-optimized)
Native resolution processing: Handles images up to 512×512 pixels without upscaling and preserves non-standard aspect ratios without distortion
Tiling strategy: Splits large images into non-overlapping 512×512 patches and includes thumbnail encoding for global context
Efficient token mapping: 2-layer MLP connector with pixel unshuffle reduces image tokens (e.g., 256×384 image → 96 tokens, 1000×3000 → 1,020 tokens)
Inference-time flexibility: User-tunable maximum image tokens and patch count for speed/quality tradeoff without retraining

Training approach

Builds on the LFM2 base model with joint mid-training that fuses vision and language capabilities using a gradually adjusted text-to-image ratio
Applies joint SFT with emphasis on image understanding and vision tasks
Leverages large-scale open-source datasets combined with in-house synthetic vision data, selected for balanced task coverage
Follows a progressive training strategy: base model → joint mid-training → supervised fine-tuning

🏃 How to run LFM2-VL

You can run LFM2-VL with Hugging Face transformers via installing Transformers from source as follows:

pip install git+https://github.com/huggingface/transformers.git@87be5595081364ef99393feeaa60d71db3652679 pillow

Here is an example of how to generate an answer with transformers in Python:

from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Load model and processor
model_id = "LiquidAI/LFM2-VL-3B"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load image and create conversation
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = load_image(url)
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

# Generate Answer
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
processor.batch_decode(outputs, skip_special_tokens=True)[0]

# This image captures a vibrant street scene in a Chinatown area. The focal point is a large red Chinese archway with gold and black accents, adorned with Chinese characters. Flanking the archway are two white stone lion statues, which are traditional guardians in Chinese culture.

You can directly run and test the model with this Colab notebook.

🔧 How to fine-tune

We recommend fine-tuning LFM2-VL models on your use cases to maximize performance.

Notebook	Description	Link
SFT (TRL)	Supervised Fine-Tuning (SFT) notebook with a LoRA adapter using TRL.

📈 Performance

Model	Average	MMStar	RealWorldQA	MM-IFEval	BLINK	MMBench (dev en)	OCRBench	POPE
InternVL3_5-2B	66.50	57.67	60.78	47.31	50.97	78.18	834.00	87.17
Qwen2.5-VL-3B	65.42	56.13	65.23	38.62	48.97	80.41	824.00	86.17
InternVL3-2B	67.44	61.10	65.10	38.49	53.10	81.10	831.00	90.10
SmolVLM2-2.2B	56.01	46.00	57.50	19.42	42.30	69.24	725.00	85.10
LFM2-VL-3B	69.00	57.73	71.37	51.83	51.03	79.81	822.00	89.01

More benchmark scores are reported in our LFM2-VL-3B post. We obtained the scores for competitive models using VLMEvalKit. Qwen3-VL-2B is not listed in the results table, as its release occurred the day before.

📬 Contact

If you are interested in custom solutions with edge deployment, please contact our sales team.

Downloads last month: 668

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for LiquidAI/LFM2-VL-3B

Quantizations

1 model

Collection including LiquidAI/LFM2-VL-3B

👁️ LFM2-VL

Collection

LFM2-VL is our first series of vision-language models, designed for on-device deployment. • 8 items • Updated about 23 hours ago • 47