LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
Abstract
LongVT, an end-to-end framework, enhances long video reasoning by interleaving global and local analysis using multimodal tools, outperforming existing methods on challenging benchmarks.
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .
Community
Project Page:https://evolvinglmms-lab.github.io/LongVT/
Tech Report:https://arxiv.org/abs/2511.20785
Github Repo:https://github.com/EvolvingLMMs-Lab/LongVT
Data and Model:https://huggingface.co/collections/lmms-lab/longvt
Demo App: https://huggingface.co/spaces/longvideotool/LongVT-Demo
Blog Post: https://www.lmms-lab.com/posts/longvt/
If you find this open-source project helpful, please consider upvoting for us. Thank you!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Video-CoM: Interactive Video Reasoning via Chain of Manipulations (2025)
- Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence (2025)
- Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination (2025)
- DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning (2025)
- Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models (2025)
- Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding (2025)
- Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend