---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
| [Paper](https://arxiv.org/abs/2510.14979) | [Code](https://github.com/EvolvingLMMs-Lab/NEO) |
## 🌟🌟 Motivation
**Two lingering clouds cast shadows over its widespread exploration and promotion:**
- What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome?
- How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field.
**We construct native VLMs built from first principles, where its primitive should:**
- effectively align pixel and word representations within a shared semantic space;
- seamlessly integrate the strengths of separate vision and language modules;
- inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning.
## 🚀🚀 Highlight
- With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives.
- NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem.
## 🧑🎨🧑🎨 Model Overview
**NEO1_0-2B** has the following features:
- Model Type: Native Vision-Language Models
- Model Mode: Mixed Native-Attn & Native-RoPE
- Layer Parameters: 56M vs. 50M (Qwen3-1.7B)
- Model Parameters: 2.2B (Non-Embedding)
- Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM)
- Number of Heads: 16 for Q and 8 for KV (GQA)
- Head Dimensions: 128 * 2 for QK and 128 for V
## 🔥🔥 Model Performance
## 📚📚 Model Weights
We release the 2B weights of **NEO1_0** in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).
| Model name | Weight |
| ---------- | ------------------------------------------------------- |
| **NEO-2B-PT** | [🤗 NEO-2B-PT HF link](https://huggingface.co/Paranioar/NEO1_0-2B-PT) |
| **NEO-2B-MT** | [🤗 NEO-2B-MT HF link](https://huggingface.co/Paranioar/NEO1_0-2B-MT) |
| **NEO-2B-SFT** | [🤗 NEO-2B-SFT HF link](https://huggingface.co/Paranioar/NEO1_0-2B-SFT) |
## ✒️✒️ Citation
If **NEO** is helpful for your research, please consider **star** ⭐ and **citation** 📝 :
```bibtex
@article{Diao2025NEO,
title = {From Pixels to Words--Towards Native Vision-Language Primitives at Scale},
author = {Diao, Haiwen and Li, Mingxuan and Wu, Silei and Dai, Linjun and Wang, Xiaohua and Deng, Hanming and Lu, Lewei and Lin, Dahua and Liu, Ziwei},
journal = {arXiv preprint arXiv:2510.14979},
year = {2025}
}
```