Multimodal Analysis
updated
Analyzing The Language of Visual Tokens
Paper
•
2411.05001
•
Published
•
23
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
•
2411.14982
•
Published
•
19
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
•
2411.17686
•
Published
•
19
On the Limitations of Vision-Language Models in Understanding Image
Transforms
Paper
•
2503.09837
•
Published
•
10
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
•
2503.12605
•
Published
•
35
When Less is Enough: Adaptive Token Reduction for Efficient Image
Representation
Paper
•
2503.16660
•
Published
•
72
From Head to Tail: Towards Balanced Representation in Large
Vision-Language Models through Adaptive Data Calibration
Paper
•
2503.12821
•
Published
•
9
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
•
2504.07951
•
Published
•
30
Textual Steering Vectors Can Improve Visual Understanding in Multimodal
Large Language Models
Paper
•
2505.14071
•
Published
•
1
MLLMs are Deeply Affected by Modality Bias
Paper
•
2505.18657
•
Published
•
5
To Trust Or Not To Trust Your Vision-Language Model's Prediction
Paper
•
2505.23745
•
Published
•
4
Vision Language Models are Biased
Paper
•
2505.23941
•
Published
•
23
Truth in the Few: High-Value Data Selection for Efficient Multi-Modal
Reasoning
Paper
•
2506.04755
•
Published
•
37
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand
Better
Paper
•
2506.09040
•
Published
•
34
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation
Models on Standard Computer Vision Tasks
Paper
•
2507.01955
•
Published
•
35
Robust Multimodal Large Language Models Against Modality Conflict
Paper
•
2507.07151
•
Published
•
5
Automating Steering for Safe Multimodal Large Language Models
Paper
•
2507.13255
•
Published
•
3
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token
Compression across Images, Videos, and Audios
Paper
•
2507.20198
•
Published
•
26
Adapting Vision-Language Models Without Labels: A Comprehensive Survey
Paper
•
2508.05547
•
Published
•
11
Enhancing Vision-Language Model Training with Reinforcement Learning in
Synthetic Worlds for Real-World Success
Paper
•
2508.04280
•
Published
•
35
Controlling Multimodal LLMs via Reward-guided Decoding
Paper
•
2508.11616
•
Published
•
7
IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding
Paper
•
2508.09456
•
Published
•
8
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Paper
•
2508.18264
•
Published
•
25
Visual Representation Alignment for Multimodal Large Language Models
Paper
•
2509.07979
•
Published
•
83
Lost in Embeddings: Information Loss in Vision-Language Models
Paper
•
2509.11986
•
Published
•
28
LLM-I: LLMs are Naturally Interleaved Multimodal Creators
Paper
•
2509.13642
•
Published
•
9
When Big Models Train Small Ones: Label-Free Model Parity Alignment for
Efficient Visual Question Answering using Small VLMs
Paper
•
2509.16633
•
Published
•
2
Where MLLMs Attend and What They Rely On: Explaining Autoregressive
Token Generation
Paper
•
2509.22496
•
Published
•
3
Visual Jigsaw Post-Training Improves MLLMs
Paper
•
2509.25190
•
Published
•
35
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in
Large Vision-Language Models
Paper
•
2510.09008
•
Published
•
15
RL makes MLLMs see better than SFT
Paper
•
2510.16333
•
Published
•
48
Revisiting Multimodal Positional Encoding in Vision-Language Models
Paper
•
2510.23095
•
Published
•
20
Don't Blind Your VLA: Aligning Visual Representations for OOD
Generalization
Paper
•
2510.25616
•
Published
•
96
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation
Paper
•
2511.03774
•
Published
•
12
Towards Mitigating Hallucinations in Large Vision-Language Models by
Refining Textual Embeddings
Paper
•
2511.05017
•
Published
•
8
10 Open Challenges Steering the Future of Vision-Language-Action Models
Paper
•
2511.05936
•
Published
•
5
Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models
Paper
•
2511.09809
•
Published
•
4
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Paper
•
2511.19418
•
Published
•
28
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Paper
•
2511.17487
•
Published
•
10
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Paper
•
2511.22663
•
Published
•
29
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
Paper
•
2511.22826
•
Published
•
7
Rethinking Chain-of-Thought Reasoning for Videos
Paper
•
2512.09616
•
Published
•
17
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Paper
•
2512.08923
•
Published
An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges
Paper
•
2512.11362
•
Published
•
21
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
Paper
•
2512.22238
•
Published
•
18