nielsr HF Staff commited on
Commit
7318f00
·
verified ·
1 Parent(s): 8bdef08

Add comprehensive model card for Monet-7B

Browse files

This PR adds a comprehensive model card for the Monet-7B model, linking it to the paper [Monet: Reasoning in Latent Visual Space Beyond Images and Language](https://huggingface.co/papers/2511.21395).

It includes:
- Essential metadata: `license: cc-by-nc-4.0`, `library_name: transformers`, and `pipeline_tag: image-text-to-text`.
- Links to the official paper and the GitHub repository.
- An overview image from the GitHub repository.
- A sample usage code snippet demonstrating how to use the model with the `transformers` library, noting that `trust_remote_code=True` is required due to custom components.
- The BibTeX citation for the paper.

This update will significantly improve discoverability and ease of use for researchers and developers on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # Monet: Reasoning in Latent Visual Space Beyond Images and Language
8
+
9
+ **Monet** is a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. It aims to achieve human-like abstract visual thinking, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps.
10
+
11
+ This model is introduced in the paper:
12
+ [**Monet: Reasoning in Latent Visual Space Beyond Images and Language**](https://huggingface.co/papers/2511.21395)
13
+
14
+ <p align="center">
15
+ <img src="https://github.com/NOVAglow646/Monet/raw/main/images/overview.png" alt="Monet Overview" width="700">
16
+ </p>
17
+
18
+ ## Installation and Code
19
+ The official implementation, training scripts, and further details can be found on the project's GitHub repository:
20
+ [https://github.com/NOVAglow646/Monet](https://github.com/NOVAglow646/Monet)
21
+
22
+ To set up the environment, please refer to the installation instructions in the GitHub repository. Note that the model uses customized `Qwen2.5-VL-7B` components, requiring specific modifications as detailed in the repository.
23
+
24
+ ## Usage
25
+ The Monet-7B model can be loaded and used with the Hugging Face `transformers` library. Due to custom model components, `trust_remote_code=True` is required.
26
+
27
+ ```python
28
+ import torch
29
+ from PIL import Image
30
+ from transformers import AutoProcessor, AutoModelForConditionalGeneration
31
+
32
+ # Load the model and processor
33
+ model_id = "NOVAglow646/Monet-7B" # Replace with the actual model repository name if different
34
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
35
+ model = AutoModelForConditionalGeneration.from_pretrained(
36
+ model_id,
37
+ torch_dtype=torch.bfloat16,
38
+ device_map="auto",
39
+ trust_remote_code=True
40
+ )
41
+
42
+ # Example: Image Understanding (Image-to-Text)
43
+ # Replace "path/to/your/image.png" with the actual path to your image file
44
+ # For example: image = Image.open("your_image.png")
45
+ # Ensure 'your_image.png' is in the same directory or provide its full path.
46
+ try:
47
+ image = Image.open("path/to/your/image.png") # Placeholder path
48
+ except FileNotFoundError:
49
+ print("Please replace 'path/to/your/image.png' with a valid image file path.")
50
+ exit()
51
+
52
+ # Prepare the chat messages for image understanding
53
+ messages = [
54
+ {
55
+ "role": "user",
56
+ "content": [
57
+ {"type": "image", "content": image},
58
+ {"type": "text", "text": "Describe the image in detail."},
59
+ ]
60
+ }
61
+ ]
62
+
63
+ # Apply the chat template and process inputs
64
+ prompt_text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
65
+ inputs = processor(text=prompt_text, images=image, return_tensors="pt").to(model.device)
66
+
67
+ # Generate output
68
+ output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
69
+
70
+ # Decode and print the response
71
+ response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
72
+ print(response)
73
+
74
+ # The model is also capable of other tasks like Text-to-Image and Omni-Potent multimodal interactions.
75
+ # Refer to the GitHub repository for more advanced usage examples and demos,
76
+ # especially regarding latent reasoning with `<abs_vis_token>`.
77
+ ```
78
+
79
+ ## Citation
80
+ If you find our work helpful or inspiring, please feel free to cite it:
81
+ ```bibtex
82
+ @misc{wang2025monetreasoninglatentvisual,
83
+ title={Monet: Reasoning in Latent Visual Space Beyond Images and Language},
84
+ author={Qixun Wang and Yang Shi and Yifei Wang and Yuanxing Zhang and Pengfei Wan and Kun Gai and Xianghua Ying and Yisen Wang},
85
+ year={2025},
86
+ eprint={2511.21395},
87
+ archivePrefix={arXiv},
88
+ primaryClass={cs.CV},
89
+ url={https://arxiv.org/abs/2511.21395},
90
+ }
91
+ ```