GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings

Python 3.8+ PyTorch 2.0+ License: MIT Hugging Face

A multimodal fashion search model that structures CLIP's 512-D embedding into dedicated color, category, and semantic subspaces through direct alignment with frozen-CLIP specialist models.


Quick Start

Installation

git clone https://github.com/Leacb4/gap-clip.git
cd gap-clip
pip install -e .

Load from Hugging Face

from example_usage import load_models_from_hf

models = load_models_from_hf("Leacb4/gap-clip")

# Extract structured embeddings from text
import torch, torch.nn.functional as F

processor = models['processor']
main_model = models['main_model']
device = models['device']

text_inputs = processor(text=["red summer dress"], padding=True, return_tensors="pt")
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}

with torch.no_grad():
    text_outputs = main_model.text_model(**text_inputs)
    text_features = main_model.text_projection(text_outputs.pooler_output)
    text_features = F.normalize(text_features, dim=-1)

color_emb     = text_features[:, :16]     # dims 0-15  β€” color
category_emb  = text_features[:, 16:80]   # dims 16-79 β€” category
general_emb   = text_features[:, 80:]     # dims 80-511 β€” general CLIP

Architecture

GAP-CLIP restructures a CLIP ViT-B/32 embedding so that specific dimension ranges are guaranteed to encode particular attributes:

Subspace Dimensions Aligned with
Color 0-15 (16 D) ColorCLIP specialist model
Category 16-79 (64 D) HierarchyModel specialist model
General CLIP 80-511 (432 D) Standard CLIP semantic space

Specialist Models (v2)

Both specialist models use frozen CLIP ViT-B/32 encoders with small trainable projection heads:

  • ColorCLIP: Frozen CLIP image/text encoder + Linear(512, 16) + L2 norm. ~16K trainable parameters.
  • HierarchyModel: Frozen CLIP image/text encoder + MLP(512 -> 128 -> 64) + LayerNorm + classifier heads. ~100K trainable parameters.

Using frozen CLIP backbones gives the specialist models the same visual-semantic understanding as the baseline, while the compact projection heads learn attribute-specific representations.

Main Model Training

The main CLIP model is fine-tuned end-to-end with an enhanced contrastive loss that combines:

  1. Triple contrastive loss (text-image, text-attributes, image-attributes)
  2. Alignment loss β€” MSE + cosine similarity between the main model's subspace dimensions and the specialist model embeddings (both text and image sides)
  3. Reference loss β€” optional regularization to stay close to the base CLIP text space
total_loss = (1 - alpha) * contrastive_loss + alpha * alignment_loss + beta * reference_loss

Where alpha = 0.2 (alignment weight) and beta = 0.1 (reference weight).

Hyperparameters: lr = 1.5e-5, temperature = 0.09, weight decay = 2.76e-5, batch size = 128, trained for 10 epochs on a 100K-sample subset.


Project Structure

.
β”œβ”€β”€ config.py                  # Paths, dimensions, device detection
β”œβ”€β”€ example_usage.py           # Load from HuggingFace + demo search
β”œβ”€β”€ setup.py                   # pip install -e .
β”œβ”€β”€ __init__.py
β”œβ”€β”€ README.md                  # This file (also the HF model card)
β”‚
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ color_model.py         # ColorCLIP: frozen CLIP + Linear(512,16)
β”‚   β”œβ”€β”€ hierarchy_model.py     # HierarchyModel: frozen CLIP + MLP(512,128,64)
β”‚   └── main_model.py          # GAP-CLIP fine-tuning with enhanced loss
β”‚
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ run_all_evaluations.py # Orchestrator for all paper evaluations
β”‚   β”œβ”€β”€ sec51_color_model_eval.py      # Table 1 β€” color accuracy
β”‚   β”œβ”€β”€ sec52_category_model_eval.py   # Table 2 β€” category accuracy
β”‚   β”œβ”€β”€ sec533_clip_nn_accuracy.py     # Table 3 β€” NN classification
β”‚   β”œβ”€β”€ sec5354_separation_semantic.py # Separation & zero-shot semantic
β”‚   β”œβ”€β”€ sec536_embedding_structure.py  # Table 4 β€” structure tests A/B/C/D
β”‚   β”œβ”€β”€ annex92_color_heatmaps.py      # Color similarity heatmaps
β”‚   β”œβ”€β”€ annex93_tsne.py                # t-SNE visualizations
β”‚   β”œβ”€β”€ annex94_search_demo.py         # Fashion search engine demo
β”‚   └── utils/
β”‚       β”œβ”€β”€ datasets.py        # Dataset loaders (internal, KAGL, FMNIST)
β”‚       β”œβ”€β”€ metrics.py         # Separation score, accuracy metrics
β”‚       └── model_loader.py    # Model loading helpers (v2 checkpoint)
β”‚
β”œβ”€β”€ models/                    # Trained weights (git-ignored, on HF Hub)
β”‚   β”œβ”€β”€ color_model.pt         # ColorCLIP checkpoint (~600 MB)
β”‚   β”œβ”€β”€ hierarchy_model.pth    # HierarchyModel checkpoint (~600 MB)
β”‚   └── gap_clip.pth           # Main GAP-CLIP checkpoint (~1.7 GB)
β”‚
β”œβ”€β”€ figures/                   # Paper figures & evaluation outputs
β”‚   β”œβ”€β”€ scheme.png             # Architecture diagram
β”‚   β”œβ”€β”€ training_curves.png    # Training/validation loss curves
β”‚   β”œβ”€β”€ heatmap.png            # GAP-CLIP color similarity heatmap
β”‚   β”œβ”€β”€ heatmap_baseline.jpg   # Baseline color similarity heatmap
β”‚   β”œβ”€β”€ tsne_*.png             # t-SNE visualizations (4 files)
β”‚   β”œβ”€β”€ *_red_dress.png        # Search demo: "red dress"
β”‚   β”œβ”€β”€ *_blue_pant.png        # Search demo: "blue pant"
β”‚   └── confusion_matrices/    # Color (8) and hierarchy (12) matrices
β”‚
β”œβ”€β”€ paper/
β”‚   β”œβ”€β”€ paper.ltx              # LaTeX source
β”‚   └── paper.pdf              # Compiled paper
β”‚
└── data/                      # Training data (git-ignored)
    └── fashion-mnist_test.csv # Fashion-MNIST evaluation set

Usage

Text Search

from example_usage import load_models_from_hf

models = load_models_from_hf("Leacb4/gap-clip")

# Use specialist models directly
color_emb = models['color_model'].get_text_embeddings(["red"])           # [1, 16]
hierarchy_emb = models['hierarchy_model'].get_text_embeddings(["dress"]) # [1, 64]

Image Search

from PIL import Image

image = Image.open("path/to/image.jpg").convert("RGB")
image_inputs = models['processor'](images=[image], return_tensors="pt")
image_inputs = {k: v.to(models['device']) for k, v in image_inputs.items()}

with torch.no_grad():
    vision_outputs = models['main_model'].vision_model(**image_inputs)
    image_features = models['main_model'].visual_projection(vision_outputs.pooler_output)
    image_features = F.normalize(image_features, dim=-1)

# Structured subspaces
color_emb     = image_features[:, :16]
category_emb  = image_features[:, 16:80]
general_emb   = image_features[:, 80:]

Alignment Check

import torch.nn.functional as F

# Compare specialist vs main-model subspace
color_from_specialist = models['color_model'].get_text_embeddings(["red"])
color_from_main = text_features[:, :16]

similarity = F.cosine_similarity(color_from_specialist, color_from_main, dim=1)
print(f"Color alignment: {similarity.item():.4f}")

CLI

# Load from HuggingFace and run example search
python example_usage.py --repo-id Leacb4/gap-clip --text "red summer dress"

# With an image
python example_usage.py --repo-id Leacb4/gap-clip --image path/to/image.jpg

Training

1. Train the Color Model

# From the repository root:
python -m training.color_model

Trains ColorCLIP: frozen CLIP ViT-B/32 + trainable Linear(512, 16) projection. Converges in ~30 min on Apple Silicon MPS. Saves checkpoint to models/color_model.pt.

2. Train the Hierarchy Model

python -m training.hierarchy_model

Trains HierarchyModel: frozen CLIP ViT-B/32 + trainable MLP(512 -> 128 -> 64) + classifier heads. Multi-objective loss (classification + contrastive + consistency). Converges in ~60 min on MPS. Saves checkpoint to models/hierarchy_model.pth.

Steps 1 and 2 can run in parallel.

3. Train the Main GAP-CLIP Model

python -m training.main_model

Fine-tunes laion/CLIP-ViT-B-32-laion2B-s34B-b79K with the enhanced contrastive loss using specialist models as alignment targets. Training features:

  • Enhanced data augmentation (rotation, color jitter, blur, affine transforms)
  • Gradient clipping (max_norm=1.0)
  • ReduceLROnPlateau scheduler (patience=3, factor=0.5)
  • Early stopping (patience=7)
  • Automatic best-model checkpointing
  • Training curves saved to figures/training_curves.png

Evaluation

Run all paper evaluations:

python evaluation/run_all_evaluations.py

Or specific sections:

python evaluation/run_all_evaluations.py --steps sec51,sec52,sec536
Step Paper Section Description
sec51 Section 5.1 Color model accuracy (Table 1)
sec52 Section 5.2 Category model confusion matrices (Table 2)
sec533 Section 5.3.3 NN classification accuracy (Table 3)
sec5354 Section 5.3.4-5 Separation & zero-shot semantic eval
sec536 Section 5.3.6 Embedding structure tests A/B/C/D (Table 4)
annex92 Annex 9.2 Color similarity heatmaps
annex93 Annex 9.3 t-SNE visualizations
annex94 Annex 9.4 Fashion search engine demo

All evaluations compare GAP-CLIP against the patrickjohncyh/fashion-clip baseline across three datasets: an internal fashion catalogue, KAGL Marqo (HuggingFace), and Fashion-MNIST.


Configuration

All paths and hyperparameters are in config.py:

import config

config.device              # Auto-detected: CUDA > MPS > CPU
config.color_emb_dim       # 16
config.hierarchy_emb_dim   # 64
config.main_emb_dim        # 512
config.print_config()      # Pretty-print settings
config.validate_paths()    # Check model files exist

Repository Files on Hugging Face

File Description
models/gap_clip.pth Main GAP-CLIP model checkpoint (~1.7 GB)
models/color_model.pt ColorCLIP specialist checkpoint (~600 MB)
models/hierarchy_model.pth HierarchyModel specialist checkpoint (~600 MB)

Citation

@misc{gap-clip-2025,
  title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search},
  author={Sarfati, Lea Attia},
  year={2025},
  howpublished={\url{https://huggingface.co/Leacb4/gap-clip}},
}

License

MIT License. See LICENSE for details.

Contact

Author: Lea Attia Sarfati Email: lea.attia@gmail.com Hugging Face: @Leacb4

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Leacb4/gap-clip

Finetuned
(4)
this model

Dataset used to train Leacb4/gap-clip