GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings
A multimodal fashion search model that structures CLIP's 512-D embedding into dedicated color, category, and semantic subspaces through direct alignment with frozen-CLIP specialist models.
Quick Start
Installation
git clone https://github.com/Leacb4/gap-clip.git
cd gap-clip
pip install -e .
Load from Hugging Face
from example_usage import load_models_from_hf
models = load_models_from_hf("Leacb4/gap-clip")
# Extract structured embeddings from text
import torch, torch.nn.functional as F
processor = models['processor']
main_model = models['main_model']
device = models['device']
text_inputs = processor(text=["red summer dress"], padding=True, return_tensors="pt")
text_inputs = {k: v.to(device) for k, v in text_inputs.items()}
with torch.no_grad():
text_outputs = main_model.text_model(**text_inputs)
text_features = main_model.text_projection(text_outputs.pooler_output)
text_features = F.normalize(text_features, dim=-1)
color_emb = text_features[:, :16] # dims 0-15 β color
category_emb = text_features[:, 16:80] # dims 16-79 β category
general_emb = text_features[:, 80:] # dims 80-511 β general CLIP
Architecture
GAP-CLIP restructures a CLIP ViT-B/32 embedding so that specific dimension ranges are guaranteed to encode particular attributes:
| Subspace | Dimensions | Aligned with |
|---|---|---|
| Color | 0-15 (16 D) | ColorCLIP specialist model |
| Category | 16-79 (64 D) | HierarchyModel specialist model |
| General CLIP | 80-511 (432 D) | Standard CLIP semantic space |
Specialist Models (v2)
Both specialist models use frozen CLIP ViT-B/32 encoders with small trainable projection heads:
- ColorCLIP: Frozen CLIP image/text encoder +
Linear(512, 16)+ L2 norm. ~16K trainable parameters. - HierarchyModel: Frozen CLIP image/text encoder +
MLP(512 -> 128 -> 64)+ LayerNorm + classifier heads. ~100K trainable parameters.
Using frozen CLIP backbones gives the specialist models the same visual-semantic understanding as the baseline, while the compact projection heads learn attribute-specific representations.
Main Model Training
The main CLIP model is fine-tuned end-to-end with an enhanced contrastive loss that combines:
- Triple contrastive loss (text-image, text-attributes, image-attributes)
- Alignment loss β MSE + cosine similarity between the main model's subspace dimensions and the specialist model embeddings (both text and image sides)
- Reference loss β optional regularization to stay close to the base CLIP text space
total_loss = (1 - alpha) * contrastive_loss + alpha * alignment_loss + beta * reference_loss
Where alpha = 0.2 (alignment weight) and beta = 0.1 (reference weight).
Hyperparameters: lr = 1.5e-5, temperature = 0.09, weight decay = 2.76e-5, batch size = 128, trained for 10 epochs on a 100K-sample subset.
Project Structure
.
βββ config.py # Paths, dimensions, device detection
βββ example_usage.py # Load from HuggingFace + demo search
βββ setup.py # pip install -e .
βββ __init__.py
βββ README.md # This file (also the HF model card)
β
βββ training/
β βββ color_model.py # ColorCLIP: frozen CLIP + Linear(512,16)
β βββ hierarchy_model.py # HierarchyModel: frozen CLIP + MLP(512,128,64)
β βββ main_model.py # GAP-CLIP fine-tuning with enhanced loss
β
βββ evaluation/
β βββ run_all_evaluations.py # Orchestrator for all paper evaluations
β βββ sec51_color_model_eval.py # Table 1 β color accuracy
β βββ sec52_category_model_eval.py # Table 2 β category accuracy
β βββ sec533_clip_nn_accuracy.py # Table 3 β NN classification
β βββ sec5354_separation_semantic.py # Separation & zero-shot semantic
β βββ sec536_embedding_structure.py # Table 4 β structure tests A/B/C/D
β βββ annex92_color_heatmaps.py # Color similarity heatmaps
β βββ annex93_tsne.py # t-SNE visualizations
β βββ annex94_search_demo.py # Fashion search engine demo
β βββ utils/
β βββ datasets.py # Dataset loaders (internal, KAGL, FMNIST)
β βββ metrics.py # Separation score, accuracy metrics
β βββ model_loader.py # Model loading helpers (v2 checkpoint)
β
βββ models/ # Trained weights (git-ignored, on HF Hub)
β βββ color_model.pt # ColorCLIP checkpoint (~600 MB)
β βββ hierarchy_model.pth # HierarchyModel checkpoint (~600 MB)
β βββ gap_clip.pth # Main GAP-CLIP checkpoint (~1.7 GB)
β
βββ figures/ # Paper figures & evaluation outputs
β βββ scheme.png # Architecture diagram
β βββ training_curves.png # Training/validation loss curves
β βββ heatmap.png # GAP-CLIP color similarity heatmap
β βββ heatmap_baseline.jpg # Baseline color similarity heatmap
β βββ tsne_*.png # t-SNE visualizations (4 files)
β βββ *_red_dress.png # Search demo: "red dress"
β βββ *_blue_pant.png # Search demo: "blue pant"
β βββ confusion_matrices/ # Color (8) and hierarchy (12) matrices
β
βββ paper/
β βββ paper.ltx # LaTeX source
β βββ paper.pdf # Compiled paper
β
βββ data/ # Training data (git-ignored)
βββ fashion-mnist_test.csv # Fashion-MNIST evaluation set
Usage
Text Search
from example_usage import load_models_from_hf
models = load_models_from_hf("Leacb4/gap-clip")
# Use specialist models directly
color_emb = models['color_model'].get_text_embeddings(["red"]) # [1, 16]
hierarchy_emb = models['hierarchy_model'].get_text_embeddings(["dress"]) # [1, 64]
Image Search
from PIL import Image
image = Image.open("path/to/image.jpg").convert("RGB")
image_inputs = models['processor'](images=[image], return_tensors="pt")
image_inputs = {k: v.to(models['device']) for k, v in image_inputs.items()}
with torch.no_grad():
vision_outputs = models['main_model'].vision_model(**image_inputs)
image_features = models['main_model'].visual_projection(vision_outputs.pooler_output)
image_features = F.normalize(image_features, dim=-1)
# Structured subspaces
color_emb = image_features[:, :16]
category_emb = image_features[:, 16:80]
general_emb = image_features[:, 80:]
Alignment Check
import torch.nn.functional as F
# Compare specialist vs main-model subspace
color_from_specialist = models['color_model'].get_text_embeddings(["red"])
color_from_main = text_features[:, :16]
similarity = F.cosine_similarity(color_from_specialist, color_from_main, dim=1)
print(f"Color alignment: {similarity.item():.4f}")
CLI
# Load from HuggingFace and run example search
python example_usage.py --repo-id Leacb4/gap-clip --text "red summer dress"
# With an image
python example_usage.py --repo-id Leacb4/gap-clip --image path/to/image.jpg
Training
1. Train the Color Model
# From the repository root:
python -m training.color_model
Trains ColorCLIP: frozen CLIP ViT-B/32 + trainable Linear(512, 16) projection. Converges in ~30 min on Apple Silicon MPS. Saves checkpoint to models/color_model.pt.
2. Train the Hierarchy Model
python -m training.hierarchy_model
Trains HierarchyModel: frozen CLIP ViT-B/32 + trainable MLP(512 -> 128 -> 64) + classifier heads. Multi-objective loss (classification + contrastive + consistency). Converges in ~60 min on MPS. Saves checkpoint to models/hierarchy_model.pth.
Steps 1 and 2 can run in parallel.
3. Train the Main GAP-CLIP Model
python -m training.main_model
Fine-tunes laion/CLIP-ViT-B-32-laion2B-s34B-b79K with the enhanced contrastive loss using specialist models as alignment targets. Training features:
- Enhanced data augmentation (rotation, color jitter, blur, affine transforms)
- Gradient clipping (max_norm=1.0)
- ReduceLROnPlateau scheduler (patience=3, factor=0.5)
- Early stopping (patience=7)
- Automatic best-model checkpointing
- Training curves saved to
figures/training_curves.png
Evaluation
Run all paper evaluations:
python evaluation/run_all_evaluations.py
Or specific sections:
python evaluation/run_all_evaluations.py --steps sec51,sec52,sec536
| Step | Paper Section | Description |
|---|---|---|
sec51 |
Section 5.1 | Color model accuracy (Table 1) |
sec52 |
Section 5.2 | Category model confusion matrices (Table 2) |
sec533 |
Section 5.3.3 | NN classification accuracy (Table 3) |
sec5354 |
Section 5.3.4-5 | Separation & zero-shot semantic eval |
sec536 |
Section 5.3.6 | Embedding structure tests A/B/C/D (Table 4) |
annex92 |
Annex 9.2 | Color similarity heatmaps |
annex93 |
Annex 9.3 | t-SNE visualizations |
annex94 |
Annex 9.4 | Fashion search engine demo |
All evaluations compare GAP-CLIP against the patrickjohncyh/fashion-clip baseline across three datasets: an internal fashion catalogue, KAGL Marqo (HuggingFace), and Fashion-MNIST.
Configuration
All paths and hyperparameters are in config.py:
import config
config.device # Auto-detected: CUDA > MPS > CPU
config.color_emb_dim # 16
config.hierarchy_emb_dim # 64
config.main_emb_dim # 512
config.print_config() # Pretty-print settings
config.validate_paths() # Check model files exist
Repository Files on Hugging Face
| File | Description |
|---|---|
models/gap_clip.pth |
Main GAP-CLIP model checkpoint (~1.7 GB) |
models/color_model.pt |
ColorCLIP specialist checkpoint (~600 MB) |
models/hierarchy_model.pth |
HierarchyModel specialist checkpoint (~600 MB) |
Citation
@misc{gap-clip-2025,
title={GAP-CLIP: Guaranteed Attribute Positioning in CLIP Embeddings for Fashion Search},
author={Sarfati, Lea Attia},
year={2025},
howpublished={\url{https://huggingface.co/Leacb4/gap-clip}},
}
License
MIT License. See LICENSE for details.
Contact
Author: Lea Attia Sarfati Email: lea.attia@gmail.com Hugging Face: @Leacb4
Model tree for Leacb4/gap-clip
Base model
laion/CLIP-ViT-B-32-laion2B-s34B-b79K