Spaces:

manu02
/

CXR-Findings-AI

Sleeping

App Files Files Community

CXR-Findings-AI / README.md

manu02

Upload 3 files

110fd76 about 2 months ago

preview code

raw

history blame

4.94 kB

	---
	title: Image-Attention-Visualizer
	emoji: 🔥
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	app_file: app.py
	license: mit
	pinned: true
	tags:
	- gradio
	- pytorch
	- computer-vision
	- nlp
	- multimodal
	- vision-language
	- image-to-text
	- attention
	- attention-visualization
	- interpretability
	- explainability
	- xai
	- demo
	---

	# [Github repo](https://github.com/devMuniz02/Image-Attention-Visualizer)
	# [TRY IT NOW ON HUGGING FACE SPACES !!](https://huggingface.co/spaces/manu02/image-attention-visualizer)

	![App working](assets/app_view.png)

	# Image-Attention-Visualizer

	Image Attention Visualizer is an interactive Gradio app that visualizes cross-modal attention between image tokens and generated text tokens in a custom multimodal model. It allows researchers and developers to see how different parts of an image influence the model’s textual output, token by token.

	# Image-to-Text Attention Visualizer (Gradio)

	An interactive Gradio app to generate text from an image using a custom multimodal model and visualize attention in real time.
	It provides 3 synchronized views — original image, attention overlay, and heatmap — plus a word-level visualization showing how each generated word attends to visual regions.

	---

	## ✨ What the app does

	* Generates text from an image input using your custom model (`create_complete_model`).
	* Displays three synchronized views:

	1. 🖼️ Original image
	2. 🔥 Overlay (original + attention heatmap)
	3. 🌈 Heatmap alone
	* Word-level attention viewer: select any generated word to see how its attention is distributed across the image and previously generated words.
	* Works directly with your custom tokenizer (`model.decoder.tokenizer`).
	* Fixed-length 1024 image tokens (32×32 grid) projected as a visual heatmap.
	* Adjustable options: Layer, Head, or Mean Across Layers/Heads.

	---

	## 🚀 Quickstart

	### 1) Clone

	```bash
	git clone https://github.com/devMuniz02/Image-Attention-Visualizer
	cd Image-Attention-Visualizer
	```

	### 2) (Optional) Create a virtual environment

	Windows (PowerShell):

	```powershell
	python -m venv venv
	.\venv\Scripts\Activate.ps1
	```

	macOS / Linux (bash/zsh):

	```bash
	python3 -m venv venv
	source venv/bin/activate
	```

	### 3) Install requirements

	```bash
	pip install -r requirements.txt
	```

	### 4) Run the app

	```bash
	python app.py
	```

	You should see something like:

	```
	Running on local URL: http://127.0.0.1:7860
	```

	### 5) Open in your browser

	Navigate to `http://127.0.0.1:7860` to use the app.

	---

	## 🧭 How to use

	1. Upload an image or load a random sample from your dataset folder.
	2. Set generation parameters:

	* Max New Tokens
	* Layer/Head selection (or average across all)
	3. Click Generate — the model will produce a textual description or continuation.
	4. Select a generated word from the list:

	* The top row will show:

	* Left → Original image
	* Center → Overlay (attention on image regions)
	* Right → Colored heatmap
	* The bottom section highlights attention strength over the generated words.

	---

	## 🧩 Files

	* `app.py` — Main Gradio interface and visualization logic.
	* `utils/models/complete_model.py` — Model definition and generation method.
	* `utils/processing.py` — Image preprocessing utilities.
	* `requirements.txt` — Dependencies.
	* `README.md` — This file.

	---

	## 🛠️ Troubleshooting

	* Black or blank heatmap: Ensure your model returns `output_attentions=True` in `.generate()`.
	* Low resolution or distortion: Adjust `img_size` or the interpolation method inside `_make_overlay`.
	* Tokenizer error: Make sure `model.decoder.tokenizer` exists and is loaded correctly.
	* OOM errors: Reduce `max_new_tokens` or use a smaller model checkpoint.
	* Color or shape mismatch: Verify that your image tokens length = 1024 (for a 32×32 layout).

	---

	## 🧪 Model integration notes

	* The app is compatible with any encoder–decoder or vision–language model that:

	* Accepts `pixel_values` as input.
	* Returns `generate(..., output_attentions=True)` with `(gen_ids, gen_text, attentions)`.
	* Uses the tokenizer from `model.decoder.tokenizer`.
	* Designed for research in vision-language interpretability, cross-modal explainability, and attention visualization.

	---

	## 📣 Acknowledgments

	* Built with [Gradio](https://www.gradio.app/) and [Hugging Face Transformers](https://huggingface.co/docs/transformers).
	* Inspired by the original [Token-Attention-Viewer](https://github.com/devMuniz02/Token-Attention-Viewer) project.
	* Special thanks to the open-source community advancing vision-language interpretability.