--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers ---

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale | [Paper](https://arxiv.org/abs/2510.14979) | [Code](https://github.com/EvolvingLMMs-Lab/NEO) |

## 🌟🌟 Motivation **Two lingering clouds cast shadows over its widespread exploration and promotion:** - What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? - How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. **We construct native VLMs built from first principles, where its primitive should:** - effectively align pixel and word representations within a shared semantic space; - seamlessly integrate the strengths of separate vision and language modules; - inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. ## 🚀🚀 Highlight - With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives. - NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem. ## 🧑‍🎨🧑‍🎨 Model Overview **NEO1_0-2B** has the following features: - Model Type: Native Vision-Language Models - Model Mode: Mixed Native-Attn & Native-RoPE - Layer Parameters: 56M vs. 50M (Qwen3-1.7B) - Model Parameters: 2.2B (Non-Embedding) - Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM) - Number of Heads: 16 for Q and 8 for KV (GQA) - Head Dimensions: 128 * 2 for QK and 128 for V ## 🔥🔥 Model Performance ## 📚📚 Model Weights We release the 2B weights of **NEO1_0** in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT). | Model name | Weight | | ---------- | ------------------------------------------------------- | | **NEO-2B-PT** | [🤗 NEO-2B-PT HF link](https://huggingface.co/Paranioar/NEO1_0-2B-PT) | | **NEO-2B-MT** | [🤗 NEO-2B-MT HF link](https://huggingface.co/Paranioar/NEO1_0-2B-MT) | | **NEO-2B-SFT** | [🤗 NEO-2B-SFT HF link](https://huggingface.co/Paranioar/NEO1_0-2B-SFT) | ## ✒️✒️ Citation If **NEO** is helpful for your research, please consider **star** ⭐ and **citation** 📝 : ```bibtex @article{Diao2025NEO, title = {From Pixels to Words--Towards Native Vision-Language Primitives at Scale}, author = {Diao, Haiwen and Li, Mingxuan and Wu, Silei and Dai, Linjun and Wang, Xiaohua and Deng, Hanming and Lu, Lewei and Lin, Dahua and Liu, Ziwei}, journal = {arXiv preprint arXiv:2510.14979}, year = {2025} } ```