AdaptLLM
/

remote-sensing-Qwen2-VL-2B-Instruct

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- Qwen/Qwen2-VL-2B-Instruct
+tags:
+- remote-sensing
+---
+# Adapting Multimodal Large Language Models to Domains via Post-Training
+This repos contains the **remote sensing MLLM developed from Qwen-2-VL-2B-Instruct** in our paper: [On Domain-Specific Post-Training for Multimodal Large Language Models](https://huggingface.co/papers/2411.19930).
+The main project page is: [Adapt-MLLM-to-Domains](https://huggingface.co/AdaptLLM/Adapt-MLLM-to-Domains/edit/main/README.md)
+## Resources
+**🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗**
+| Model                                                                       | Repo ID in HF 🤗                           | Domain       | Base Model              | Training Data                                                                                  | Evaluation Benchmark |
+|:----------------------------------------------------------------------------|:--------------------------------------------|:--------------|:-------------------------|:------------------------------------------------------------------------------------------------|-----------------------|
+| [Visual Instruction Synthesizer](https://huggingface.co/AdaptLLM/visual-instruction-synthesizer) | AdaptLLM/visual-instruction-synthesizer     | -  | open-llava-next-llama3-8b    | VisionFLAN and ALLaVA | -                   |
+| [AdaMLLM-med-2B](https://huggingface.co/AdaptLLM/biomed-Qwen2-VL-2B-Instruct) | AdaptLLM/biomed-Qwen2-VL-2B-Instruct     | Biomedicine  | Qwen2-VL-2B-Instruct    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
+| [AdaMLLM-food-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/food-Qwen2-VL-2B-Instruct     | Food  | Qwen2-VL-2B-Instruct    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
+| [AdaMLLM-remote-sensing-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct     | Remote Sensing  | Qwen2-VL-2B-Instruct    | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
+| [AdaMLLM-med-8B](https://huggingface.co/AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B) | AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B     | Biomedicine  | open-llava-next-llama3-8b    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
+| [AdaMLLM-food-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/food-LLaVA-NeXT-Llama3-8B     | Food  | open-llava-next-llama3-8b    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
+| [AdaMLLM-remote-sensing-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/remote-sensing-LLaVA-NeXT-Llama3-8B     | Remote Sensing  | open-llava-next-llama3-8b    | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
+| [AdaMLLM-med-11B](https://huggingface.co/AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct     | Biomedicine  | Llama-3.2-11B-Vision-Instruct    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
+| [AdaMLLM-food-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/food-Llama-3.2-11B-Vision-Instruct     | Food | Llama-3.2-11B-Vision-Instruct    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
+| [AdaMLLM-remote-sensing-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/remote-sensing-Llama-3.2-11B-Vision-Instruct     | Remote Sensing | Llama-3.2-11B-Vision-Instruct    | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
+**Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
+## 1. To Chat with AdaMLLM
+Our model architecture aligns with the base model: Qwen-2-VL-Instruct. We provide a usage example below, and you may refer to the official [Qwen-2-VL-Instruct repository](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) for more advanced usage instructions.
+**Note:** For AdaMLLM, always place the image at the beginning of the input instruction in the messages.
+<details>
+<summary> Click to expand </summary>
+1. Set up
+```bash
+pip install qwen-vl-utils
+```
+2. Inference
+```python
+from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# default: Load the model on the available device(s)
+model = Qwen2VLForConditionalGeneration.from_pretrained(
+    "AdaptLLM/food-Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
+)
+# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
+# model = Qwen2VLForConditionalGeneration.from_pretrained(
+#     "AdaptLLM/food-Qwen2-VL-2B-Instruct",
+#     torch_dtype=torch.bfloat16,
+#     attn_implementation="flash_attention_2",
+#     device_map="auto",
+# )
+# default processer
+processor = AutoProcessor.from_pretrained("AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct")
+# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
+# min_pixels = 256*28*28
+# max_pixels = 1280*28*28
+# processor = AutoProcessor.from_pretrained("AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
+# NOTE: For AdaMLLM, always place the image at the beginning of the input instruction in the messages.
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+</details>
+## 2. To Evaluate Any MLLM on Domain-Specific Benchmarks
+Refer to the [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/remote-sensing-VQA-benchmark) to reproduce our results and evaluate many other MLLMs on domain-specific benchmarks.
+## 3. To Reproduce this Domain-Adapted MLLM
+See [Post-Train Guide](https://github.com/bigai-ai/QA-Synthesizer/blob/main/docs/Post_Train.md) to adapt MLLMs to domains.
+## Citation
+If you find our work helpful, please cite us.
+[AdaMLLM](https://huggingface.co/papers/2411.19930)
+```bibtex
+@article{adamllm,
+  title={On Domain-Specific Post-Training for Multimodal Large Language Models},
+  author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
+  journal={arXiv preprint arXiv:2411.19930},
+  year={2024}
+}
+```
+[Adapt LLM to Domains](https://huggingface.co/papers/2309.09530) (ICLR 2024)
+```bibtex
+@inproceedings{
+cheng2024adapting,
+title={Adapting Large Language Models via Reading Comprehension},
+author={Daixuan Cheng and Shaohan Huang and Furu Wei},
+booktitle={The Twelfth International Conference on Learning Representations},
+year={2024},
+url={https://openreview.net/forum?id=y886UXPEZ0}
+}
+```