GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings

A multimodal fashion search model that structures CLIP's 512-D embedding into dedicated color, category, and semantic subspaces through direct alignment with frozen-CLIP specialist models.

Quick Start

Installation

git clone https://github.com/Leacb4/gap-clip.git
cd gap-clip
pip install -e .

Load from Hugging Face

from example_usage import load_models_from_hf

models = load_models_from_hf("Leacb4/gap-clip")

# Extract structured embeddings from text
import torch, torch.nn.functional as F

processor = models['processor']
main_model = models['main_model']
device = models['device']

text_inputs = processor(text=["red summer dress"], padding=True, return_tensors="pt")
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}

with torch.no_grad():
    text_outputs = main_model.text_model(**text_inputs)
    text_features = main_model.text_projection(text_outputs.pooler_output)
    text_features = F.normalize(text_features, dim=-1)

color_emb     = text_features[:, :16]     # dims 0-15  — color
category_emb  = text_features[:, 16:80]   # dims 16-79 — category
general_emb   = text_features[:, 80:]     # dims 80-511 — general CLIP

Architecture

GAP-CLIP restructures a CLIP ViT-B/32 embedding so that specific dimension ranges are guaranteed to encode particular attributes:

Subspace	Dimensions	Aligned with
Color	0-15 (16 D)	ColorCLIP specialist model
Category	16-79 (64 D)	HierarchyModel specialist model
General CLIP	80-511 (432 D)	Standard CLIP semantic space

Specialist Models (v2)

Both specialist models use frozen CLIP ViT-B/32 encoders with small trainable projection heads:

ColorCLIP: Frozen CLIP image/text encoder + Linear(512, 16) + L2 norm. ~16K trainable parameters.
HierarchyModel: Frozen CLIP image/text encoder + MLP(512 -> 128 -> 64) + LayerNorm + classifier heads. ~100K trainable parameters.

Using frozen CLIP backbones gives the specialist models the same visual-semantic understanding as the baseline, while the compact projection heads learn attribute-specific representations.

Main Model Training

The main CLIP model is fine-tuned end-to-end with an enhanced contrastive loss that combines:

Triple contrastive loss (text-image, text-attributes, image-attributes)
Alignment loss — MSE + cosine similarity between the main model's subspace dimensions and the specialist model embeddings (both text and image sides)
Reference loss — optional regularization to stay close to the base CLIP text space

total_loss = (1 - alpha) * contrastive_loss + alpha * alignment_loss + beta * reference_loss

Where alpha = 0.2 (alignment weight) and beta = 0.1 (reference weight).

Hyperparameters: lr = 1.5e-5, temperature = 0.09, weight decay = 2.76e-5, batch size = 128, trained for 10 epochs on a 100K-sample subset.

Project Structure

.
├── config.py                  # Paths, dimensions, device detection
├── example_usage.py           # Load from HuggingFace + demo search
├── setup.py                   # pip install -e .
├── __init__.py
├── README.md                  # This file (also the HF model card)
│
├── training/
│   ├── color_model.py         # ColorCLIP: frozen CLIP + Linear(512,16)
│   ├── hierarchy_model.py     # HierarchyModel: frozen CLIP + MLP(512,128,64)
│   └── main_model.py          # GAP-CLIP fine-tuning with enhanced loss
│
├── evaluation/
│   ├── run_all_evaluations.py # Orchestrator for all paper evaluations
│   ├── sec51_color_model_eval.py      # Table 1 — color accuracy
│   ├── sec52_category_model_eval.py   # Table 2 — category accuracy
│   ├── sec533_clip_nn_accuracy.py     # Table 3 — NN classification
│   ├── sec5354_separation_semantic.py # Separation & zero-shot semantic
│   ├── sec536_embedding_structure.py  # Table 4 — structure tests A/B/C/D
│   ├── annex92_color_heatmaps.py      # Color similarity heatmaps
│   ├── annex93_tsne.py                # t-SNE visualizations
│   ├── annex94_search_demo.py         # Fashion search engine demo
│   └── utils/
│       ├── datasets.py        # Dataset loaders (internal, KAGL, FMNIST)
│       ├── metrics.py         # Separation score, accuracy metrics
│       └── model_loader.py    # Model loading helpers (v2 checkpoint)
│
├── models/                    # Trained weights (git-ignored, on HF Hub)
│   ├── color_model.pt         # ColorCLIP checkpoint (~600 MB)
│   ├── hierarchy_model.pth    # HierarchyModel checkpoint (~600 MB)
│   └── gap_clip.pth           # Main GAP-CLIP checkpoint (~1.7 GB)
│
├── figures/                   # Paper figures & evaluation outputs
│   ├── scheme.png             # Architecture diagram
│   ├── training_curves.png    # Training/validation loss curves
│   ├── heatmap.png            # GAP-CLIP color similarity heatmap
│   ├── heatmap_baseline.jpg   # Baseline color similarity heatmap
│   ├── tsne_*.png             # t-SNE visualizations (4 files)
│   ├── *_red_dress.png        # Search demo: "red dress"
│   ├── *_blue_pant.png        # Search demo: "blue pant"
│   └── confusion_matrices/    # Color (8) and hierarchy (12) matrices
│
├── paper/
│   ├── paper.ltx              # LaTeX source
│   └── paper.pdf              # Compiled paper
│
└── data/                      # Training data (git-ignored)
    └── fashion-mnist_test.csv # Fashion-MNIST evaluation set

Usage

Text Search

from example_usage import load_models_from_hf

models = load_models_from_hf("Leacb4/gap-clip")

# Use specialist models directly
color_emb = models['color_model'].get_text_embeddings(["red"])           # [1, 16]
hierarchy_emb = models['hierarchy_model'].get_text_embeddings(["dress"]) # [1, 64]

Image Search

from PIL import Image

image = Image.open("path/to/image.jpg").convert("RGB")
image_inputs = models['processor'](images=[image], return_tensors="pt")
image_inputs = {k: v.to(models['device']) for k, v in image_inputs.items()}

with torch.no_grad():
    vision_outputs = models['main_model'].vision_model(**image_inputs)
    image_features = models['main_model'].visual_projection(vision_outputs.pooler_output)
    image_features = F.normalize(image_features, dim=-1)

# Structured subspaces
color_emb     = image_features[:, :16]
category_emb  = image_features[:, 16:80]
general_emb   = image_features[:, 80:]

Alignment Check

import torch.nn.functional as F

# Compare specialist vs main-model subspace
color_from_specialist = models['color_model'].get_text_embeddings(["red"])
color_from_main = text_features[:, :16]

similarity = F.cosine_similarity(color_from_specialist, color_from_main, dim=1)
print(f"Color alignment: {similarity.item():.4f}")

CLI

# Load from HuggingFace and run example search
python example_usage.py --repo-id Leacb4/gap-clip --text "red summer dress"

# With an image
python example_usage.py --repo-id Leacb4/gap-clip --image path/to/image.jpg

Training

1. Train the Color Model

# From the repository root:
python -m training.color_model

Trains ColorCLIP: frozen CLIP ViT-B/32 + trainable Linear(512, 16) projection. Converges in ~30 min on Apple Silicon MPS. Saves checkpoint to models/color_model.pt.

2. Train the Hierarchy Model

python -m training.hierarchy_model

Trains HierarchyModel: frozen CLIP ViT-B/32 + trainable MLP(512 -> 128 -> 64) + classifier heads. Multi-objective loss (classification + contrastive + consistency). Converges in ~60 min on MPS. Saves checkpoint to models/hierarchy_model.pth.

Steps 1 and 2 can run in parallel.

3. Train the Main GAP-CLIP Model

python -m training.main_model

Fine-tunes laion/CLIP-ViT-B-32-laion2B-s34B-b79K with the enhanced contrastive loss using specialist models as alignment targets. Training features:

Enhanced data augmentation (rotation, color jitter, blur, affine transforms)
Gradient clipping (max_norm=1.0)
ReduceLROnPlateau scheduler (patience=3, factor=0.5)
Early stopping (patience=7)
Automatic best-model checkpointing
Training curves saved to figures/training_curves.png

Evaluation

Run all paper evaluations:

python evaluation/run_all_evaluations.py

Or specific sections:

python evaluation/run_all_evaluations.py --steps sec51,sec52,sec536

Step	Paper Section	Description
`sec51`	Section 5.1	Color model accuracy (Table 1)
`sec52`	Section 5.2	Category model confusion matrices (Table 2)
`sec533`	Section 5.3.3	NN classification accuracy (Table 3)
`sec5354`	Section 5.3.4-5	Separation & zero-shot semantic eval
`sec536`	Section 5.3.6	Embedding structure tests A/B/C/D (Table 4)
`annex92`	Annex 9.2	Color similarity heatmaps
`annex93`	Annex 9.3	t-SNE visualizations
`annex94`	Annex 9.4	Fashion search engine demo

All evaluations compare GAP-CLIP against the patrickjohncyh/fashion-clip baseline across three datasets: an internal fashion catalogue, KAGL Marqo (HuggingFace), and Fashion-MNIST.

Configuration

All paths and hyperparameters are in config.py:

import config

config.device              # Auto-detected: CUDA > MPS > CPU
config.color_emb_dim       # 16
config.hierarchy_emb_dim   # 64
config.main_emb_dim        # 512
config.print_config()      # Pretty-print settings
config.validate_paths()    # Check model files exist

Repository Files on Hugging Face

File	Description
`models/gap_clip.pth`	Main GAP-CLIP model checkpoint (~1.7 GB)
`models/color_model.pt`	ColorCLIP specialist checkpoint (~600 MB)
`models/hierarchy_model.pth`	HierarchyModel specialist checkpoint (~600 MB)

Citation

@misc{gap-clip-2025,
  title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search},
  author={Sarfati, Lea Attia},
  year={2025},
  howpublished={\url{https://huggingface.co/Leacb4/gap-clip}},
}

License

MIT License. See LICENSE for details.

Contact

Author: Lea Attia Sarfati Email: lea.attia@gmail.com Hugging Face: @Leacb4

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Leacb4/gap-clip

Base model

laion/CLIP-ViT-B-32-laion2B-s34B-b79K

Finetuned

(4)

this model

Leacb4
/

gap-clip