FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding

Hugging Face Dataset Hugging Face Model Hugging Face Paper

Paper: FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding Code: https://github.com/NJU-LHRS/FarSLIP

Introduction

We introduce FarSLIP, a vision-language foundation model for remote sensing (RS) that achieves fine-grained vision-language alignment. FarSLIP demonstrates state-of-the-art performance on both fine-grained and image-level tasks, including open-vocabulary semantic segmentation, zero-shot classification, and image-text retrieval. We also construct MGRS-200k, the first multi-granularity image-text dataset for RS. Each image is annotated with both short and long global-level captions, along with multiple object-category pairs.

Checkpoints

You can download all our checkpoints from Huggingface, or selectively download them through the links below.

Model name Architecture OVSS mIoU (%) ZSC top-1 accuracy (%) Download
FarSLIP-s1 ViT-B-32 29.87 58.64 FarSLIP1_ViT-B-32
FarSLIP-s2 ViT-B-32 30.49 60.12 FarSLIP2_ViT-B-32
FarSLIP-s1 ViT-B-16 35.44 61.89 FarSLIP1_ViT-B-16
FarSLIP-s2 ViT-B-16 35.41 62.24 FarSLIP2_ViT-B-16

Dataset

FarSLIP is trained in two stages.

  • In the first stage, we use the RS5M dataset. A quick portal to the RS5M dataset: link.
  • In the second stage, we use the proposed MGRS-200k dataset, which is available on Huggingface.


Examples from MGRS-200k

Usage / Testing

Below is a sample usage for zero-shot scene classification, taken directly from the official GitHub repository.

Zero-shot scene classification

  • Please refer to SkyScript for scene classification dataset preparation, including 'SkyScript_cls', 'aid', 'eurosat', 'fmow', 'millionaid', 'patternnet', 'rsicb', 'nwpu'.

  • Replace the BENCHMARK_DATASET_ROOT_DIR in tests/test_scene_classification.py to your own path.

  • Run testing (e.g. FarSLIP-s1 with ViT-B-32):

python -m tests.test_scene_classification --model-arch ViT-B-32 --model-name FarSLIP1 --force-quick-gelu --pretrained checkpoints/FarSLIP1_ViT-B-32.pt
Comparison of zero-shot classification accuracies (Top-1 acc., %) of different RS-specific CLIP variants across multiple benchmarks.

Citation

If you find our work is useful, please give us ⭐ in GitHub and consider cite our paper:

@article{li2025farslip,
title={FarSLIP: Discovering Effective CLIP Adaptation for Fine-Grained Remote Sensing Understanding},
author={Zhenshi Li and Weikang Yu and Dilxat Muhtar and Xueliang Zhang and Pengfeng Xiao and Pedram Ghamisi and Xiao Xiang Zhu},
journal={arXiv preprint arXiv:2511.14901},
year={2025}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train ZhenShiL/FarSLIP