new

Get trending papers in your email inbox once a day!

Get trending papers in your email inbox!

Trending Papers

byAK and the research community

Trending Papers

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

54 authors

· Published on Nov 14, 2025

GitHub 4.59k arXiv Page

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

54 authors

· Nov 14, 2025

GitHub 4.59k arXiv Page

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

29 authors

· Published on Jan 6, 2026

GitHub 2.14k arXiv Page

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

29 authors

· Jan 6, 2026

GitHub 2.14k arXiv Page

Submitted by

xiaochonglinghu

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Large vision-language models are enhanced for image geolocalization by incorporating map-based reasoning and agent-in-the-map loop optimization, achieving superior accuracy compared to existing models.

alibaba-inc

alibaba-inc · Published on Jan 8, 2026

GitHub 102 arXiv Page

Submitted by

xiaochonglinghu

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Large vision-language models are enhanced for image geolocalization by incorporating map-based reasoning and agent-in-the-map loop optimization, achieving superior accuracy compared to existing models.

alibaba-inc

alibaba-inc · Jan 8, 2026

GitHub 102 arXiv Page

Submitted by

thenlper

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

The Qwen3-VL-Embedding and Qwen3-VL-Reranker models form an end-to-end multimodal search pipeline, leveraging multi-stage training and cross-attention mechanisms to achieve high-precision retrieval across diverse modalities.

Qwen

Qwen · Published on Jan 8, 2026

GitHub 600 arXiv Page

Submitted by

thenlper

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

The Qwen3-VL-Embedding and Qwen3-VL-Reranker models form an end-to-end multimodal search pipeline, leveraging multi-stage training and cross-attention mechanisms to achieve high-precision retrieval across diverse modalities.

Qwen

Qwen · Jan 8, 2026

GitHub 600 arXiv Page

Submitted by

JiaaqiLiu

SimpleMem: Efficient Lifelong Memory for LLM Agents

To support reliable long-term interaction in complex environments, LLM agents require memory systems that efficiently manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which applies entropy-aware filtering to distill unstructured interactions into compact, multi-view indexed memory units; (2) Recursive Memory Consolidation, an asynchronous process that integrates related units into higher-level abstract representations to reduce redundancy; and (3) Adaptive Query-Aware Retrieval, which dynamically adjusts retrieval scope based on query complexity to construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

8 authors

· Published on Jan 5, 2026

GitHub 824 arXiv Page

Submitted by

JiaaqiLiu

SimpleMem: Efficient Lifelong Memory for LLM Agents

To support reliable long-term interaction in complex environments, LLM agents require memory systems that efficiently manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which applies entropy-aware filtering to distill unstructured interactions into compact, multi-view indexed memory units; (2) Recursive Memory Consolidation, an asynchronous process that integrates related units into higher-level abstract representations to reduce redundancy; and (3) Adaptive Query-Aware Retrieval, which dynamically adjusts retrieval scope based on query complexity to construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

8 authors

· Jan 5, 2026

GitHub 824 arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

ibm-granite

IBM Granite · Published on Mar 14, 2025

GitHub 49.8k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

ibm-granite

IBM Granite · Mar 14, 2025

GitHub 49.8k arXiv Page

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

VideoRAG enhances large language models for multi-modal video processing with a dual-channel architecture that integrates textual knowledge grounding and multi-modal context encoding.

6 authors

· Published on Feb 3, 2025

GitHub 2.43k arXiv Page

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

VideoRAG enhances large language models for multi-modal video processing with a dual-channel architecture that integrates textual knowledge grounding and multi-modal context encoding.

6 authors

· Feb 3, 2025

GitHub 2.43k arXiv Page

Submitted by

Viglong

Orient Anything V2: Unifying Orientation and Rotation Understanding

Orient Anything V2 enhances 3D orientation understanding through scalable 3D asset synthesis, symmetry-aware periodic distribution fitting, and multi-frame relative rotation prediction, achieving state-of-the-art performance across multiple benchmarks.

8 authors

· Published on Jan 9, 2026

GitHub 82 arXiv Page

Submitted by

Viglong

Orient Anything V2: Unifying Orientation and Rotation Understanding

Orient Anything V2 enhances 3D orientation understanding through scalable 3D asset synthesis, symmetry-aware periodic distribution fitting, and multi-frame relative rotation prediction, achieving state-of-the-art performance across multiple benchmarks.

8 authors

· Jan 9, 2026

GitHub 82 arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Published on Nov 17, 2025

GitHub 14.9k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Nov 17, 2025

GitHub 14.9k arXiv Page

Scaling Large-Language-Model-based Multi-Agent Collaboration

Multi-agent collaboration networks enhance collective intelligence, outperforming baselines across various topologies and showing emergent abilities earlier than neural scaling laws suggest.

10 authors

· Published on Jun 11, 2024

GitHub 28.3k arXiv Page

Scaling Large-Language-Model-based Multi-Agent Collaboration

Multi-agent collaboration networks enhance collective intelligence, outperforming baselines across various topologies and showing emergent abilities earlier than neural scaling laws suggest.

10 authors

· Jun 11, 2024

GitHub 28.3k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

GitHub 65.5k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

GitHub 65.5k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle

PaddlePaddle · Published on Oct 16, 2025

GitHub 67.9k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle

PaddlePaddle · Oct 16, 2025

GitHub 67.9k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Published on Sep 12, 2023

GitHub 67.4k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Sep 12, 2023

GitHub 67.4k arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Published on Sep 27, 2024

GitHub 52k arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Sep 27, 2024

GitHub 52k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

GitHub 52k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

GitHub 52k arXiv Page

Submitted by

taesiri

HunyuanVideo 1.5 Technical Report

HunyuanVideo 1.5 is a lightweight video generation model with state-of-the-art visual quality and motion coherence, using a DiT architecture with SSTA and an efficient video super-resolution network.

81 authors

· Published on Nov 24, 2025

GitHub 3.16k arXiv Page

Submitted by

taesiri

HunyuanVideo 1.5 Technical Report

HunyuanVideo 1.5 is a lightweight video generation model with state-of-the-art visual quality and motion coherence, using a DiT architecture with SSTA and an efficient video super-resolution network.

81 authors

· Nov 24, 2025

GitHub 3.16k arXiv Page

Submitted by

SteveZeyuZhang

AnyDepth: Depth Estimation Made Easy

A lightweight monocular depth estimation framework uses DINOv3 as visual encoder and a compact transformer decoder to achieve higher accuracy with reduced computational overhead and improved data quality.

PekingUniversity

Peking University · Published on Jan 6, 2026

GitHub 47 arXiv Page

Submitted by

SteveZeyuZhang

AnyDepth: Depth Estimation Made Easy

A lightweight monocular depth estimation framework uses DINOv3 as visual encoder and a compact transformer decoder to achieve higher accuracy with reduced computational overhead and improved data quality.

PekingUniversity

Peking University · Jan 6, 2026

GitHub 47 arXiv Page

Multi-Agent Software Development through Cross-Team Collaboration

Cross-Team Collaboration improves software quality by enabling multiple LLM agent teams to propose and communicate decisions.

8 authors

· Published on Jun 13, 2024

GitHub 28.3k arXiv Page

Multi-Agent Software Development through Cross-Team Collaboration

Cross-Team Collaboration improves software quality by enabling multiple LLM agent teams to propose and communicate decisions.

8 authors

· Jun 13, 2024

GitHub 28.3k arXiv Page

Submitted by

taesiri

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Youtu-LLM is a lightweight language model optimized for computational efficiency and agentic intelligence through a compact architecture, STEM-focused training curriculum, and scalable mid-training strategies for planning and reasoning tasks.

tencent

Tencent · Published on Dec 31, 2025

GitHub 398 arXiv Page

Submitted by

taesiri

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Youtu-LLM is a lightweight language model optimized for computational efficiency and agentic intelligence through a compact architecture, STEM-focused training curriculum, and scalable mid-training strategies for planning and reasoning tasks.

tencent

Tencent · Dec 31, 2025

GitHub 398 arXiv Page

Submitted by

sliuau

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

nvidia

NVIDIA · Published on Jan 8, 2026

GitHub 173 arXiv Page

Submitted by

sliuau

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

nvidia

NVIDIA · Jan 8, 2026

GitHub 173 arXiv Page

Submitted by

buaahsh

BitNet Distillation

BitNet Distillation fine-tunes large language models to 1.58-bit precision using SubLN, multi-head attention distillation, and continual pre-training, achieving comparable performance with significant memory and inference speed improvements.

MicrosoftResearch

Microsoft Research · Published on Oct 15, 2025

GitHub 25.7k arXiv Page

Submitted by

buaahsh

BitNet Distillation

BitNet Distillation fine-tunes large language models to 1.58-bit precision using SubLN, multi-head attention distillation, and continual pre-training, achieving comparable performance with significant memory and inference speed improvements.

MicrosoftResearch

Microsoft Research · Oct 15, 2025

GitHub 25.7k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

GitHub 28k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

GitHub 28k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Published on Feb 7, 2025

GitHub 62.8k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Feb 7, 2025

GitHub 62.8k arXiv Page

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Bitnet.cpp enhances edge inference for ternary LLMs using a novel mixed-precision matrix multiplication library, achieving significant speed improvements over baselines.

10 authors

· Published on Feb 17, 2025

GitHub 25.7k arXiv Page

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Bitnet.cpp enhances edge inference for ternary LLMs using a novel mixed-precision matrix multiplication library, achieving significant speed improvements over baselines.

10 authors

· Feb 17, 2025

GitHub 25.7k arXiv Page

Submitted by

hongyuw

BitNet b1.58 2B4T Technical Report

BitNet b1.58 2B4T, a 1-bit Large Language Model with 2 billion parameters, matches the performance of full-precision models while improving computational efficiency.

8 authors

· Published on Apr 16, 2025

GitHub 25.7k arXiv Page

Submitted by

hongyuw

BitNet b1.58 2B4T Technical Report

BitNet b1.58 2B4T, a 1-bit Large Language Model with 2 billion parameters, matches the performance of full-precision models while improving computational efficiency.

8 authors

· Apr 16, 2025

GitHub 25.7k arXiv Page

Submitted by

XuGuo699

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

A novel video face swapping framework combines image face swapping techniques with diffusion transformers and curriculum learning to achieve superior identity preservation and visual realism.

ByteDance

ByteDance · Published on Jan 4, 2026

GitHub 404 arXiv Page

Submitted by

XuGuo699

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

A novel video face swapping framework combines image face swapping techniques with diffusion transformers and curriculum learning to achieve superior identity preservation and visual realism.

ByteDance

ByteDance · Jan 4, 2026

GitHub 404 arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

GitHub 45.4k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

GitHub 45.4k arXiv Page

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin, a multimodal document image parsing model, uses heterogeneous anchor prompting to achieve state-of-the-art performance on diverse page-level and element-level tasks through an efficient analyze-then-parse paradigm.

13 authors

· Published on May 20, 2025

GitHub 8.61k arXiv Page

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting

Dolphin, a multimodal document image parsing model, uses heterogeneous anchor prompting to achieve state-of-the-art performance on diverse page-level and element-level tasks through an efficient analyze-then-parse paradigm.

13 authors

· May 20, 2025

GitHub 8.61k arXiv Page

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

MicrosoftResearch

Microsoft Research · Published on Aug 26, 2025

GitHub 20.2k arXiv Page

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

MicrosoftResearch

Microsoft Research · Aug 26, 2025

GitHub 20.2k arXiv Page

Submitted by

rajkumarrawal

Recursive Language Models

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

MIT

Massachusetts Institute of Technology · Published on Dec 31, 2025

GitHub 927 arXiv Page

Submitted by

rajkumarrawal

Recursive Language Models

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

MIT

Massachusetts Institute of Technology · Dec 31, 2025

GitHub 927 arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI

Tongyi-MAI · Published on Nov 27, 2025

GitHub 8.9k arXiv Page

Submitted by

Paper99

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image, a 6B-parameter Scalable Single-Stream Diffusion Transformer (S3-DiT) model, achieves high-performance image generation with reduced computational cost, offering sub-second inference and compatibility with consumer hardware.

Tongyi-MAI

Tongyi-MAI · Nov 27, 2025

GitHub 8.9k arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI

Tongyi-MAI · Published on Nov 27, 2025

GitHub 8.9k arXiv Page

Submitted by

Cxxs

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

The study reveals that in text-to-image generation, CFG Augmentation is the primary driver of few-step distillation in Distribution Matching Distillation (DMD), while the distribution matching term acts as a regularizer.

Tongyi-MAI

Tongyi-MAI · Nov 27, 2025

GitHub 8.9k arXiv Page

Submitted by

taesiri

UniVideo: Unified Understanding, Generation, and Editing for Videos

UniVideo, a dual-stream framework combining a Multimodal Large Language Model and a Multimodal DiT, extends unified modeling to video generation and editing, achieving state-of-the-art performance and supporting task composition and generalization.

KlingTeam

Kling Team · Published on Oct 9, 2025

GitHub 263 arXiv Page

Submitted by

taesiri

UniVideo: Unified Understanding, Generation, and Editing for Videos

UniVideo, a dual-stream framework combining a Multimodal Large Language Model and a Multimodal DiT, extends unified modeling to video generation and editing, achieving state-of-the-art performance and supporting task composition and generalization.

KlingTeam

Kling Team · Oct 9, 2025

GitHub 263 arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Published on Jul 23, 2024

GitHub 66.5k arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Jul 23, 2024

GitHub 66.5k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Published on Feb 8, 2025

GitHub 17.8k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Feb 8, 2025

GitHub 17.8k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

GitHub 21.9k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

GitHub 21.9k arXiv Page

MediaPipe: A Framework for Building Perception Pipelines

MediaPipe framework facilitates the development of perception applications by providing tools for combining components, prototyping, and measuring performance across platforms.

14 authors

· Published on Jun 14, 2019

GitHub 33k arXiv Page

MediaPipe: A Framework for Building Perception Pipelines

MediaPipe framework facilitates the development of perception applications by providing tools for combining components, prototyping, and measuring performance across platforms.

14 authors

· Jun 14, 2019

GitHub 33k arXiv Page

Submitted by

amael-apple

Sharp Monocular View Synthesis in Less Than a Second

SHARP synthesizes photorealistic views from a single image using a 3D Gaussian representation, achieving state-of-the-art results with rapid processing.

apple

Apple · Published on Dec 11, 2025

GitHub 6.77k arXiv Page

Submitted by

amael-apple

Sharp Monocular View Synthesis in Less Than a Second

SHARP synthesizes photorealistic views from a single image using a 3D Gaussian representation, achieving state-of-the-art results with rapid processing.

apple

Apple · Dec 11, 2025

GitHub 6.77k arXiv Page

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

mmGRPO, a multi-module extension of GRPO, enhances accuracy in modular AI systems by optimizing LM calls and prompts across various tasks.

13 authors

· Published on Aug 6, 2025

GitHub 31.5k arXiv Page

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

mmGRPO, a multi-module extension of GRPO, enhances accuracy in modular AI systems by optimizing LM calls and prompts across various tasks.

13 authors

· Aug 6, 2025

GitHub 31.5k arXiv Page

Submitted by

taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

AI at Meta · Published on Nov 20, 2025

GitHub 5.49k arXiv Page

Submitted by

taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

AI at Meta · Nov 20, 2025

GitHub 5.49k arXiv Page

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Published on Dec 8, 2025

GitHub 13.8k arXiv Page

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Dec 8, 2025

GitHub 13.8k arXiv Page

Submitted by

taesiri

NitroGen: An Open Foundation Model for Generalist Gaming Agents

NitroGen is a vision-action foundation model trained on extensive gameplay data that demonstrates strong cross-game generalization and effective transfer learning capabilities.

nvidia

NVIDIA · Published on Jan 4, 2026

GitHub 1.6k arXiv Page

Submitted by

taesiri

NitroGen: An Open Foundation Model for Generalist Gaming Agents

NitroGen is a vision-action foundation model trained on extensive gameplay data that demonstrates strong cross-game generalization and effective transfer learning capabilities.

nvidia

NVIDIA · Jan 4, 2026

GitHub 1.6k arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Published on Jun 28, 2020

GitHub 96.6k arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Jun 28, 2020

GitHub 96.6k arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Published on Apr 14, 2025

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Apr 14, 2025

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

16 authors

· Published on Apr 21, 2023

GitHub 96.6k arXiv Page

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

16 authors

· Apr 21, 2023

GitHub 96.6k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

GitHub 27.2k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

GitHub 27.2k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

hkuds

Data Intelligence Lab@HKU · Published on Oct 14, 2025

GitHub 12.1k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

hkuds

Data Intelligence Lab@HKU · Oct 14, 2025

GitHub 12.1k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Published on Oct 26, 2025

GitHub 18.3k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Oct 26, 2025

GitHub 18.3k arXiv Page

PDFMathTranslate: Scientific Document Translation Preserving Layouts

PDFMathTranslate enables layout-preserving scientific document translation using large language models and precise layout detection, offering improved precision, flexibility, and efficiency.

4 authors

· Published on Jul 2, 2025

GitHub 31.2k arXiv Page

PDFMathTranslate: Scientific Document Translation Preserving Layouts

PDFMathTranslate enables layout-preserving scientific document translation using large language models and precise layout detection, offering improved precision, flexibility, and efficiency.

4 authors

· Jul 2, 2025

GitHub 31.2k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Published on Nov 20, 2025

GitHub 6.97k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Nov 20, 2025

GitHub 6.97k arXiv Page