erjui commited on
Commit
88d3ff8
·
verified ·
1 Parent(s): 736852b

Upload model_card.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. model_card.md +202 -0
model_card.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision
5
+ - image-classification
6
+ - clip
7
+ - knowledge-distillation
8
+ - semi-supervised-learning
9
+ - imagenet
10
+ datasets:
11
+ - imagenet-1k
12
+ library_name: pytorch
13
+ pipeline_tag: image-classification
14
+ ---
15
+
16
+ # DHO: Simple Few-shot Semi-supervised Knowledge Distillation
17
+
18
+ [![arXiv](https://img.shields.io/badge/arXiv-2505.07675v1-b31b1b.svg)](https://arxiv.org/abs/2505.07675v1)
19
+ [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/simple-semi-supervised-knowledge-distillation/semi-supervised-image-classification-on-1)](https://paperswithcode.com/sota/semi-supervised-image-classification-on-1?p=simple-semi-supervised-knowledge-distillation)
20
+ [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/simple-semi-supervised-knowledge-distillation/semi-supervised-image-classification-on-2)](https://paperswithcode.com/sota/semi-supervised-image-classification-on-2?p=simple-semi-supervised-knowledge-distillation)
21
+
22
+ This repository contains pretrained checkpoints for **DHO (Dual-Head Optimization)**, a simple yet effective approach for semi-supervised knowledge distillation from Vision-Language Models.
23
+
24
+ ## Model Description
25
+
26
+ DHO introduces a dual-head optimization strategy that enables efficient knowledge transfer from large Vision-Language Models (e.g., CLIP) to smaller student models using minimal labeled data.
27
+ The method achieves state-of-the-art performance on ImageNet semi-supervised learning benchmarks with only 1% and 10% labeled data.
28
+
29
+ **Paper:** [Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization](https://arxiv.org/abs/2505.07675)
30
+
31
+ **Authors:** Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang
32
+
33
+ ## Key Features
34
+
35
+ - ✨ **Dual-head optimization** strategy for semi-supervised distillation
36
+ - 🏆 **State-of-the-art** performance on ImageNet with 1% and 10% labeled data
37
+ - 🔄 Efficient transfer from VLMs (e.g., CLIP) to smaller student models
38
+ - 🧩 Simple, scalable, and easy to integrate into existing pipelines
39
+
40
+ ## Available Checkpoints
41
+
42
+ | Checkpoint Name | Student Model | Teacher Model | Labeled Data | Top-1 Acc. | Parameters |
43
+ |:----------------|:--------------|:--------------|:-------------|:-----------|:-----------|
44
+ | `vit_b_1.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 1% | 81.6% | 86M |
45
+ | `vit_b_10.pt` | ViT-B/16 | ViT-H/14 (DFN5B) | 10% | 82.8% | 86M |
46
+ | `vit_l_1.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 1% | 84.6% | 304M |
47
+ | `vit_l_10.pt` | ViT-L/14 | ViT-H/14 (DFN5B) | 10% | 85.9% | 304M |
48
+
49
+ ## Usage
50
+
51
+ ### Loading a Checkpoint
52
+
53
+ ```python
54
+ import torch
55
+ import torch.nn as nn
56
+ import torch.nn.functional as F
57
+ import clip
58
+ from huggingface_hub import hf_hub_download
59
+
60
+ # Define the DHO StudentModel architecture with dual heads
61
+ class StudentModel(nn.Module):
62
+ def __init__(self, num_classes=1000, model_name='ViT-B-16'):
63
+ super().__init__()
64
+ # Load CLIP backbone
65
+ clip_model, _ = clip.load(model_name, device='cpu')
66
+ self.backbone = clip_model.float().visual
67
+
68
+ # Feature dimensions per architecture
69
+ in_features = {
70
+ 'RN50': 1024,
71
+ 'ViT-B-16': 512,
72
+ 'ViT-L-14': 768,
73
+ 'ViT-L-14-336px': 768
74
+ }[model_name]
75
+
76
+ # Dual-head architecture
77
+ self.ce_head = nn.Linear(in_features, num_classes) # CE branch
78
+ self.kd_head = nn.Linear(in_features, num_classes) # KD branch
79
+
80
+ def forward(self, x):
81
+ features = self.backbone(x)
82
+ ce_out = self.ce_head(features)
83
+ kd_out = self.kd_head(F.normalize(features, dim=1)) * 100
84
+ return ce_out, kd_out
85
+
86
+ # Download and load checkpoint
87
+ device = "cuda" if torch.cuda.is_available() else "cpu"
88
+ checkpoint_path = hf_hub_download(repo_id="erjui/dho", filename="vit_b_10.pt")
89
+ checkpoint = torch.load(checkpoint_path, map_location=device)
90
+
91
+ # Initialize model
92
+ model = StudentModel(num_classes=1000, model_name='ViT-B-16').to(device)
93
+
94
+ # Handle DDP wrapped state_dict
95
+ state_dict = checkpoint['model_state_dict']
96
+ state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
97
+ model.load_state_dict(state_dict)
98
+
99
+ # Get optimal inference parameters
100
+ alpha = checkpoint['alpha'] # Weight for CE head
101
+ beta = checkpoint['beta'] # Temperature for KD head
102
+ model.eval()
103
+
104
+ # Inference example
105
+ from PIL import Image
106
+ import torchvision.transforms as transforms
107
+
108
+ # CLIP preprocessing
109
+ preprocess = transforms.Compose([
110
+ transforms.Resize(224),
111
+ transforms.CenterCrop(224),
112
+ transforms.ToTensor(),
113
+ transforms.Normalize(mean=(0.48145466, 0.4578275, 0.40821073),
114
+ std=(0.26862954, 0.26130258, 0.27577711))
115
+ ])
116
+
117
+ image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0).to(device)
118
+ with torch.no_grad():
119
+ ce_logits, kd_logits = model(image)
120
+
121
+ # Combine predictions using saved parameters
122
+ probs_ce = F.softmax(ce_logits, dim=1)
123
+ probs_kd = F.softmax(kd_logits / beta, dim=1)
124
+ probs = alpha * probs_ce + (1 - alpha) * probs_kd
125
+
126
+ predicted_class = probs.argmax(dim=1)
127
+ print(f"Predicted class: {predicted_class.item()}")
128
+ ```
129
+
130
+ **Important Notes:**
131
+ - DHO checkpoints contain: `model_state_dict`, `epoch`, `acc`, `alpha`, `beta`
132
+ - The model has a **dual-head architecture** (CE head + KD head)
133
+ - Use the saved `alpha` and `beta` parameters for optimal inference
134
+ - For ViT-L checkpoints, change `model_name='ViT-L-14'` and use image size 224 (or 336 for ViT-L-14-336px)
135
+
136
+ ### Training Your Own Model
137
+
138
+ To train your own DHO model, please visit the [official GitHub repository](https://github.com/yourusername/DHO) for detailed instructions and training scripts.
139
+
140
+ **Example training command:**
141
+ ```bash
142
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=29500 train_imgnet_semi.py \
143
+ --teacher_model "apple/DFN5B-CLIP-ViT-H-14-378" \
144
+ --student_model "ViT-B-16" \
145
+ --lr 5e-5 \
146
+ --train_epoch 32 \
147
+ --batch_size 256 \
148
+ --percent 10.0 \
149
+ | tee ./logs/imagenet/imgnet_lowshot.log
150
+ ```
151
+
152
+ ## Model Architecture
153
+
154
+ The DHO student model consists of:
155
+ - **Backbone:** CLIP Vision Transformer (ViT-B/16 or ViT-L/14)
156
+ - **Two parallel heads:**
157
+ - **CE Head:** Optimized with cross-entropy loss on labeled data
158
+ - **KD Head:** Optimized with knowledge distillation loss from teacher predictions
159
+
160
+ During inference, predictions from both heads are combined using learned weighting parameters (alpha, beta).
161
+
162
+ ## Performance
163
+
164
+ ### ImageNet Semi-supervised Learning
165
+
166
+ | Student | Teacher | Labeled Data | Top-1 Accuracy |
167
+ |:--------|:--------|:-------------|:---------------|
168
+ | ViT-B/16 | ViT-H/14 | 1% | **81.6%** |
169
+ | ViT-B/16 | ViT-H/14 | 10% | **82.8%** |
170
+ | ViT-L/14 | ViT-H/14 | 1% | **84.6%** |
171
+ | ViT-L/14 | ViT-H/14 | 10% | **85.9%** |
172
+
173
+ These results establish new state-of-the-art benchmarks for semi-supervised learning on ImageNet-1K.
174
+
175
+ ## Citation
176
+
177
+ If you use these models in your research, please cite:
178
+
179
+ ```bibtex
180
+ @article{kang2025simple,
181
+ title={Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization},
182
+ author={Kang, Seongjae and Lee, Dong Bok and Jang, Hyungjoon and Hwang, Sung Ju},
183
+ journal={arXiv preprint arXiv:2505.07675},
184
+ year={2025}
185
+ }
186
+ ```
187
+
188
+ ## License
189
+
190
+ This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
191
+
192
+ ## Acknowledgments
193
+
194
+ We appreciate the open-source implementations from:
195
+ - [Tip-Adapter](https://github.com/gaopengcuhk/Tip-Adapter)
196
+ - [CLIP](https://github.com/openai/CLIP)
197
+ - [OpenCLIP](https://github.com/mlfoundations/open_clip)
198
+
199
+ ## Contact
200
+
201
+ For questions or issues, please open an issue on the [GitHub repository](https://github.com/yourusername/DHO) or contact the authors.
202
+