erjui
/

dho

+---
+license: apache-2.0
+tags:
+- vision
+- image-classification
+- clip
+- knowledge-distillation
+- semi-supervised-learning
+- imagenet
+datasets:
+- imagenet-1k
+library_name: pytorch
+pipeline_tag: image-classification
+---
+# DHO: Simple Few-shot Semi-supervised Knowledge Distillation
+[![arXiv](https://img.shields.io/badge/arXiv-2505.07675v1-b31b1b.svg)](https://arxiv.org/abs/2505.07675v1)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/simple-semi-supervised-knowledge-distillation/semi-supervised-image-classification-on-1)](https://paperswithcode.com/sota/semi-supervised-image-classification-on-1?p=simple-semi-supervised-knowledge-distillation)
+[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/simple-semi-supervised-knowledge-distillation/semi-supervised-image-classification-on-2)](https://paperswithcode.com/sota/semi-supervised-image-classification-on-2?p=simple-semi-supervised-knowledge-distillation)
+This repository contains pretrained checkpoints for **DHO (Dual-Head Optimization)**, a simple yet effective approach for semi-supervised knowledge distillation from Vision-Language Models.
+## Model Description
+DHO introduces a dual-head optimization strategy that enables efficient knowledge transfer from large Vision-Language Models (e.g., CLIP) to smaller student models using minimal labeled data.
+The method achieves state-of-the-art performance on ImageNet semi-supervised learning benchmarks with only 1% and 10% labeled data.
+**Paper:** [Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization](https://arxiv.org/abs/2505.07675)
+**Authors:** Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang
+## Key Features
+- ✨ **Dual-head optimization** strategy for semi-supervised distillation
+- 🏆 **State-of-the-art** performance on ImageNet with 1% and 10% labeled data
+- 🔄 Efficient transfer from VLMs (e.g., CLIP) to smaller student models
+- 🧩 Simple, scalable, and easy to integrate into existing pipelines
+## Available Checkpoints
+| Checkpoint Name | Student Model | Teacher Model | Labeled Data | Top-1 Acc. | Parameters |
+|:----------------|:--------------|:--------------|:-------------|:-----------|:-----------|
+| `vit_b_1.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 1% | 81.6% | 86M |
+| `vit_b_10.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 10% | 82.8% | 86M |
+| `vit_l_1.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 1% | 84.6% | 304M |
+| `vit_l_10.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 10% | 85.9% | 304M |
+## Usage
+### Loading a Checkpoint
+```python
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import clip
+from huggingface_hub import hf_hub_download
+# Define the DHO StudentModel architecture with dual heads
+class StudentModel(nn.Module):
+    def __init__(self, num_classes=1000, model_name='ViT-B-16'):
+        super().__init__()
+        # Load CLIP backbone
+        clip_model, _ = clip.load(model_name, device='cpu')
+        self.backbone = clip_model.float().visual
+        # Feature dimensions per architecture
+        in_features = {
+            'RN50': 1024,
+            'ViT-B-16': 512,
+            'ViT-L-14': 768,
+            'ViT-L-14-336px': 768
+        }[model_name]
+        # Dual-head architecture
+        self.ce_head = nn.Linear(in_features, num_classes)  # CE branch
+        self.kd_head = nn.Linear(in_features, num_classes)  # KD branch
+    def forward(self, x):
+        features = self.backbone(x)
+        ce_out = self.ce_head(features)
+        kd_out = self.kd_head(F.normalize(features, dim=1)) * 100
+        return ce_out, kd_out
+# Download and load checkpoint
+device = "cuda" if torch.cuda.is_available() else "cpu"
+checkpoint_path = hf_hub_download(repo_id="erjui/dho", filename="vit_b_10.pt")
+checkpoint = torch.load(checkpoint_path, map_location=device)
+# Initialize model
+model = StudentModel(num_classes=1000, model_name='ViT-B-16').to(device)
+# Handle DDP wrapped state_dict
+state_dict = checkpoint['model_state_dict']
+state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
+model.load_state_dict(state_dict)
+# Get optimal inference parameters
+alpha = checkpoint['alpha']  # Weight for CE head
+beta = checkpoint['beta']    # Temperature for KD head
+model.eval()
+# Inference example
+from PIL import Image
+import torchvision.transforms as transforms
+# CLIP preprocessing
+preprocess = transforms.Compose([
+    transforms.Resize(224),
+    transforms.CenterCrop(224),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073),
+                        std=(0.26862954, 0.26130258, 0.27577711))
+])
+image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0).to(device)
+with torch.no_grad():
+    ce_logits, kd_logits = model(image)
+    # Combine predictions using saved parameters
+    probs_ce = F.softmax(ce_logits, dim=1)
+    probs_kd = F.softmax(kd_logits / beta, dim=1)
+    probs = alpha * probs_ce + (1 - alpha) * probs_kd
+    predicted_class = probs.argmax(dim=1)
+    print(f"Predicted class: {predicted_class.item()}")
+```
+**Important Notes:**
+- DHO checkpoints contain: `model_state_dict`, `epoch`, `acc`, `alpha`, `beta`
+- The model has a **dual-head architecture** (CE head + KD head)
+- Use the saved `alpha` and `beta` parameters for optimal inference
+- For ViT-L checkpoints, change `model_name='ViT-L-14'` and use image size 224 (or 336 for ViT-L-14-336px)
+### Training Your Own Model
+To train your own DHO model, please visit the [official GitHub repository](https://github.com/yourusername/DHO) for detailed instructions and training scripts.
+**Example training command:**
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=29500 train_imgnet_semi.py \
+    --teacher_model "apple/DFN5B-CLIP-ViT-H-14-378" \
+    --student_model "ViT-B-16" \
+    --lr 5e-5 \
+    --train_epoch 32 \
+    --batch_size 256 \
+    --percent 10.0 \
+    | tee ./logs/imagenet/imgnet_lowshot.log
+```
+## Model Architecture
+The DHO student model consists of:
+- **Backbone:** CLIP Vision Transformer (ViT-B/16 or ViT-L/14)
+- **Two parallel heads:**
+  - **CE Head:** Optimized with cross-entropy loss on labeled data
+  - **KD Head:** Optimized with knowledge distillation loss from teacher predictions
+During inference, predictions from both heads are combined using learned weighting parameters (alpha, beta).
+## Performance
+### ImageNet Semi-supervised Learning
+| Student | Teacher | Labeled Data | Top-1 Accuracy |
+|:--------|:--------|:-------------|:---------------|
+| ViT-B/16 | ViT-H/14 | 1% | **81.6%** |
+| ViT-B/16 | ViT-H/14 | 10% | **82.8%** |
+| ViT-L/14 | ViT-H/14 | 1% | **84.6%** |
+| ViT-L/14 | ViT-H/14 | 10% | **85.9%** |
+These results establish new state-of-the-art benchmarks for semi-supervised learning on ImageNet-1K.
+## Citation
+If you use these models in your research, please cite:
+```bibtex
+@article{kang2025simple,
+  title={Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization},
+  author={Kang, Seongjae and Lee, Dong Bok and Jang, Hyungjoon and Hwang, Sung Ju},
+  journal={arXiv preprint arXiv:2505.07675},
+  year={2025}
+}
+```
+## License
+This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
+## Acknowledgments
+We appreciate the open-source implementations from:
+- [Tip-Adapter](https://github.com/gaopengcuhk/Tip-Adapter)
+- [CLIP](https://github.com/openai/CLIP)
+- [OpenCLIP](https://github.com/mlfoundations/open_clip)
+## Contact
+For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/DHO) or contact the authors.