Spaces:
Sleeping
Sleeping
File size: 4,939 Bytes
110fd76 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
title: Image-Attention-Visualizer
emoji: 🔥
colorFrom: blue
colorTo: purple
sdk: gradio
app_file: app.py
license: mit
pinned: true
tags:
- gradio
- pytorch
- computer-vision
- nlp
- multimodal
- vision-language
- image-to-text
- attention
- attention-visualization
- interpretability
- explainability
- xai
- demo
---
# [Github repo](https://github.com/devMuniz02/Image-Attention-Visualizer)
# [TRY IT NOW ON HUGGING FACE SPACES !!](https://huggingface.co/spaces/manu02/image-attention-visualizer)

# Image-Attention-Visualizer
Image Attention Visualizer is an interactive Gradio app that visualizes **cross-modal attention** between image tokens and generated text tokens in a custom multimodal model. It allows researchers and developers to see how different parts of an image influence the model’s textual output, token by token.
# Image-to-Text Attention Visualizer (Gradio)
An interactive Gradio app to **generate text from an image using a custom multimodal model** and **visualize attention in real time**.
It provides 3 synchronized views — original image, attention overlay, and heatmap — plus a **word-level visualization** showing how each generated word attends to visual regions.
---
## ✨ What the app does
* **Generates text** from an image input using your custom model (`create_complete_model`).
* Displays **three synchronized views**:
1. 🖼️ **Original image**
2. 🔥 **Overlay** (original + attention heatmap)
3. 🌈 **Heatmap alone**
* **Word-level attention viewer**: select any generated word to see how its attention is distributed across the image and previously generated words.
* Works directly with your **custom tokenizer (`model.decoder.tokenizer`)**.
* Fixed-length **1024 image tokens (32×32 grid)** projected as a visual heatmap.
* Adjustable options: **Layer**, **Head**, or **Mean Across Layers/Heads**.
---
## 🚀 Quickstart
### 1) Clone
```bash
git clone https://github.com/devMuniz02/Image-Attention-Visualizer
cd Image-Attention-Visualizer
```
### 2) (Optional) Create a virtual environment
**Windows (PowerShell):**
```powershell
python -m venv venv
.\venv\Scripts\Activate.ps1
```
**macOS / Linux (bash/zsh):**
```bash
python3 -m venv venv
source venv/bin/activate
```
### 3) Install requirements
```bash
pip install -r requirements.txt
```
### 4) Run the app
```bash
python app.py
```
You should see something like:
```
Running on local URL: http://127.0.0.1:7860
```
### 5) Open in your browser
Navigate to `http://127.0.0.1:7860` to use the app.
---
## 🧭 How to use
1. **Upload an image** or load a random sample from your dataset folder.
2. **Set generation parameters**:
* Max New Tokens
* Layer/Head selection (or average across all)
3. Click **Generate** — the model will produce a textual description or continuation.
4. **Select a generated word** from the list:
* The top row will show:
* Left → **Original image**
* Center → **Overlay (attention on image regions)**
* Right → **Colored heatmap**
* The bottom section highlights attention strength over the generated words.
---
## 🧩 Files
* `app.py` — Main Gradio interface and visualization logic.
* `utils/models/complete_model.py` — Model definition and generation method.
* `utils/processing.py` — Image preprocessing utilities.
* `requirements.txt` — Dependencies.
* `README.md` — This file.
---
## 🛠️ Troubleshooting
* **Black or blank heatmap:** Ensure your model returns `output_attentions=True` in `.generate()`.
* **Low resolution or distortion:** Adjust `img_size` or the interpolation method inside `_make_overlay`.
* **Tokenizer error:** Make sure `model.decoder.tokenizer` exists and is loaded correctly.
* **OOM errors:** Reduce `max_new_tokens` or use a smaller model checkpoint.
* **Color or shape mismatch:** Verify that your image tokens length = 1024 (for a 32×32 layout).
---
## 🧪 Model integration notes
* The app is compatible with any **encoder–decoder or vision–language model** that:
* Accepts `pixel_values` as input.
* Returns `generate(..., output_attentions=True)` with `(gen_ids, gen_text, attentions)`.
* Uses the tokenizer from `model.decoder.tokenizer`.
* Designed for research in **vision-language interpretability**, **cross-modal explainability**, and **attention visualization**.
---
## 📣 Acknowledgments
* Built with [Gradio](https://www.gradio.app/) and [Hugging Face Transformers](https://huggingface.co/docs/transformers).
* Inspired by the original [Token-Attention-Viewer](https://github.com/devMuniz02/Token-Attention-Viewer) project.
* Special thanks to the open-source community advancing **vision-language interpretability**. |