Near, far: Patch-ordering enhances vision foundation models' scene understanding
Welcome to the Hugging Face repository for NeCo. an adapted vision encoder that captures fine-grained details and structural information essential for performing key-point matching, semantic segmentation and more. This repository hosts pretrained checkpoints for NeCo, enabling easy integration into your projects.
Our paper discussing our work:
"Near, far: Patch-ordering enhances vision foundation models' scene understanding"
Valentinos Pariza, Mohammadreza Salehi,Gertjan J. Burghouts, Francesco Locatello, Yuki M. Asano  
๐ Project Page โจ๏ธ GitHub Repository ๐ Read the Paper on arXiv
Model Details
Model Description
NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation.
- Model type: Vision Encoder (Dino, Dinov2, ...)
- Language(s) (NLP): Python
- License: MIT
- Finetuned from model [optional]: Dinov2, Dinov2R, Dino, ...
How to Get Started with the Model
To use NeCo models on downstream dense prediction tasks, you just need to install timm  and torch and depending on which checkpoint you use you can load it as follows:
The models can be download from our NeCo Hugging Face repo.
Models after post-training dinov2 (following dinov2 architecture)
NeCo on Dinov2
import torch
# change to dinov2_vitb14 for base as described in:
#    https://github.com/facebookresearch/dinov2
model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') 
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
NeCo on Dinov2 with Registers
import torch
# change to dinov2_vitb14_reg for base as described in:
#    https://github.com/facebookresearch/dinov2
model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg') 
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
Models after post-training dino or similar (following dino architecture)
timm vit-small and vit-base architectures
import torch
from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224
# Change to vit_base_patch8_224() if you want to use our larger model
model = vit_small_patch16_224()  
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint, map_location='cpu')
model.load_state_dict(state_dict, strict=False)
Note: In case you want to directly load the weights of the model from a hugging face url, please execute:
import torch
state_dict = torch.hub.load_state_dict_from_url("<url to the hugging face checkpoint>")
Training Details
Training Data
- We have post-trained our models on the COCO Dataset.
Training Procedure
Please look our repository and read our paper for more details.
Environmental Impact
- Hardware Type: NVIDIA A100 GPU
- Hours used: 18 (per model)
- Cloud Provider: Helma NHR FAU (Germany), (Snellius The Netherlands)
- Compute Region: Europe/Germany & Netherlands
Citation
BibTeX:
@inproceedings{
   pariza2025near,
   title={Near, far: Patch-ordering enhances vision foundation models' scene understanding},
   author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano},
   booktitle={The Thirteenth International Conference on Learning Representations},
   year={2025},
   url={https://openreview.net/forum?id=Qro97zWC29}
}
Model tree for FunAILab/NeCo
Base model
facebook/dinov2-base