NOVAglow646
/

Monet-7B

+---
+license: cc-by-nc-4.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+# Monet: Reasoning in Latent Visual Space Beyond Images and Language
+**Monet** is a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. It aims to achieve human-like abstract visual thinking, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps.
+This model is introduced in the paper:
+[**Monet: Reasoning in Latent Visual Space Beyond Images and Language**](https://huggingface.co/papers/2511.21395)
+<p align="center">
+    <img src="https://github.com/NOVAglow646/Monet/raw/main/images/overview.png" alt="Monet Overview" width="700">
+</p>
+## Installation and Code
+The official implementation, training scripts, and further details can be found on the project's GitHub repository:
+[https://github.com/NOVAglow646/Monet](https://github.com/NOVAglow646/Monet)
+To set up the environment, please refer to the installation instructions in the GitHub repository. Note that the model uses customized `Qwen2.5-VL-7B` components, requiring specific modifications as detailed in the repository.
+## Usage
+The Monet-7B model can be loaded and used with the Hugging Face `transformers` library. Due to custom model components, `trust_remote_code=True` is required.
+```python
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForConditionalGeneration
+# Load the model and processor
+model_id = "NOVAglow646/Monet-7B" # Replace with the actual model repository name if different
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+# Example: Image Understanding (Image-to-Text)
+# Replace "path/to/your/image.png" with the actual path to your image file
+# For example: image = Image.open("your_image.png")
+# Ensure 'your_image.png' is in the same directory or provide its full path.
+try:
+    image = Image.open("path/to/your/image.png") # Placeholder path
+except FileNotFoundError:
+    print("Please replace 'path/to/your/image.png' with a valid image file path.")
+    exit()
+# Prepare the chat messages for image understanding
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "content": image},
+            {"type": "text", "text": "Describe the image in detail."},
+        ]
+    }
+]
+# Apply the chat template and process inputs
+prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+inputs = processor(text=prompt_text, images=image, return_tensors="pt").to(model.device)
+# Generate output
+output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
+# Decode and print the response
+response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+print(response)
+# The model is also capable of other tasks like Text-to-Image and Omni-Potent multimodal interactions.
+# Refer to the GitHub repository for more advanced usage examples and demos,
+# especially regarding latent reasoning with `<abs_vis_token>`.
+```
+## Citation
+If you find our work helpful or inspiring, please feel free to cite it:
+```bibtex
+@misc{wang2025monetreasoninglatentvisual,
+      title={Monet: Reasoning in Latent Visual Space Beyond Images and Language},
+      author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang},
+      year={2025},
+      eprint={2511.21395},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2511.21395},
+}
+```