Papers
arxiv:2510.23479

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

Published on Oct 27
Β· Submitted by Xin Jin on Oct 28
Authors:
,
,
,

Abstract

MergeMix, a training-time augmentation method, combines attention-aware image mixing and preference-driven training to improve vision-language alignment in multi-modal large language models with enhanced efficiency and accuracy.

AI-generated summary

Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.

Community

Paper submitter

TL;DR: We propose MergeMix, a unified Mixup-based augmentation method.
It leverages the source matrix from Token Merge to generate mixed augmented samples with richer feature continuity for image classification, and utilizes these augmented samples for performance pair tuning to enhance the alignment capability of MLLMs.

🌐 Webpage: https://github.com/JinXins/MergeMix_Web
πŸ’» Codes:

🌟 In this work, we mainly address two core challenges:

  1. How to achieve an optimal trade-off between efficiency and performance in saliency-based Mixup methods.
  2. How to properly extend Mixup to preference tuning, transitioning from traditional image corruption to data-dependent sample generation.

πŸ“Š Our results show that:

  • For image classification, our Token Merge–based design achieves an excellent balance between performance and computational efficiency.
  • On LLaVA benchmarks, even with a small number of vision tokens during training and inference, MergeMix surpasses the performance of the full-token LLaVA model.
  • It also brings consistent improvements in robustness across both image classification and multi-modal tasks.

πŸ’‘ This work represents a new attempt. We aim to:

  1. Look back β€” revisit classical machine learning methods and explore their potential in the era of large models;
  2. Repurpose β€” enable traditional techniques like Mixup to shine anew in the LLM/MLLM era.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.23479 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.23479 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.23479 in a Space README.md to link it from this page.

Collections including this paper 2