Add comprehensive model card for Monet-7B
Browse filesThis PR adds a comprehensive model card for the Monet-7B model, linking it to the paper [Monet: Reasoning in Latent Visual Space Beyond Images and Language](https://huggingface.co/papers/2511.21395).
It includes:
- Essential metadata: `license: cc-by-nc-4.0`, `library_name: transformers`, and `pipeline_tag: image-text-to-text`.
- Links to the official paper and the GitHub repository.
- An overview image from the GitHub repository.
- A sample usage code snippet demonstrating how to use the model with the `transformers` library, noting that `trust_remote_code=True` is required due to custom components.
- The BibTeX citation for the paper.
This update will significantly improve discoverability and ease of use for researchers and developers on the Hugging Face Hub.
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Monet: Reasoning in Latent Visual Space Beyond Images and Language
|
| 8 |
+
|
| 9 |
+
**Monet** is a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. It aims to achieve human-like abstract visual thinking, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps.
|
| 10 |
+
|
| 11 |
+
This model is introduced in the paper:
|
| 12 |
+
[**Monet: Reasoning in Latent Visual Space Beyond Images and Language**](https://huggingface.co/papers/2511.21395)
|
| 13 |
+
|
| 14 |
+
<p align="center">
|
| 15 |
+
<img src="https://github.com/NOVAglow646/Monet/raw/main/images/overview.png" alt="Monet Overview" width="700">
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
## Installation and Code
|
| 19 |
+
The official implementation, training scripts, and further details can be found on the project's GitHub repository:
|
| 20 |
+
[https://github.com/NOVAglow646/Monet](https://github.com/NOVAglow646/Monet)
|
| 21 |
+
|
| 22 |
+
To set up the environment, please refer to the installation instructions in the GitHub repository. Note that the model uses customized `Qwen2.5-VL-7B` components, requiring specific modifications as detailed in the repository.
|
| 23 |
+
|
| 24 |
+
## Usage
|
| 25 |
+
The Monet-7B model can be loaded and used with the Hugging Face `transformers` library. Due to custom model components, `trust_remote_code=True` is required.
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
import torch
|
| 29 |
+
from PIL import Image
|
| 30 |
+
from transformers import AutoProcessor, AutoModelForConditionalGeneration
|
| 31 |
+
|
| 32 |
+
# Load the model and processor
|
| 33 |
+
model_id = "NOVAglow646/Monet-7B" # Replace with the actual model repository name if different
|
| 34 |
+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 35 |
+
model = AutoModelForConditionalGeneration.from_pretrained(
|
| 36 |
+
model_id,
|
| 37 |
+
torch_dtype=torch.bfloat16,
|
| 38 |
+
device_map="auto",
|
| 39 |
+
trust_remote_code=True
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
# Example: Image Understanding (Image-to-Text)
|
| 43 |
+
# Replace "path/to/your/image.png" with the actual path to your image file
|
| 44 |
+
# For example: image = Image.open("your_image.png")
|
| 45 |
+
# Ensure 'your_image.png' is in the same directory or provide its full path.
|
| 46 |
+
try:
|
| 47 |
+
image = Image.open("path/to/your/image.png") # Placeholder path
|
| 48 |
+
except FileNotFoundError:
|
| 49 |
+
print("Please replace 'path/to/your/image.png' with a valid image file path.")
|
| 50 |
+
exit()
|
| 51 |
+
|
| 52 |
+
# Prepare the chat messages for image understanding
|
| 53 |
+
messages = [
|
| 54 |
+
{
|
| 55 |
+
"role": "user",
|
| 56 |
+
"content": [
|
| 57 |
+
{"type": "image", "content": image},
|
| 58 |
+
{"type": "text", "text": "Describe the image in detail."},
|
| 59 |
+
]
|
| 60 |
+
}
|
| 61 |
+
]
|
| 62 |
+
|
| 63 |
+
# Apply the chat template and process inputs
|
| 64 |
+
prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
|
| 65 |
+
inputs = processor(text=prompt_text, images=image, return_tensors="pt").to(model.device)
|
| 66 |
+
|
| 67 |
+
# Generate output
|
| 68 |
+
output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
|
| 69 |
+
|
| 70 |
+
# Decode and print the response
|
| 71 |
+
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
|
| 72 |
+
print(response)
|
| 73 |
+
|
| 74 |
+
# The model is also capable of other tasks like Text-to-Image and Omni-Potent multimodal interactions.
|
| 75 |
+
# Refer to the GitHub repository for more advanced usage examples and demos,
|
| 76 |
+
# especially regarding latent reasoning with `<abs_vis_token>`.
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
## Citation
|
| 80 |
+
If you find our work helpful or inspiring, please feel free to cite it:
|
| 81 |
+
```bibtex
|
| 82 |
+
@misc{wang2025monetreasoninglatentvisual,
|
| 83 |
+
title={Monet: Reasoning in Latent Visual Space Beyond Images and Language},
|
| 84 |
+
author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang},
|
| 85 |
+
year={2025},
|
| 86 |
+
eprint={2511.21395},
|
| 87 |
+
archivePrefix={arXiv},
|
| 88 |
+
primaryClass={cs.CV},
|
| 89 |
+
url={https://arxiv.org/abs/2511.21395},
|
| 90 |
+
}
|
| 91 |
+
```
|