NexaAI/qwen3vl-30B-A3B-mlx
π§ Quickstart
Run directly with the nexa-sdk CLI:
nexa infer NexaAI/qwen3vl-30B-A3B-mlx
β οΈ Note: You need at least 64 GB of RAM on your Mac to run this model.
π§ Model Overview
Qwen3-VL-30B-A3B-Instruct is a cutting-edge vision-language model from the Qwen3 series, offering advanced reasoning, spatial perception, long-context understanding, and seamless integration between text and visual data. This model is part of the A3B (Advanced Agent + 3D + Multimodal Boost) instruct-tuned lineup.
π Key Features
Visual Agent Capabilities Understands and interacts with GUIs, software tools, and system elements for agentic task automation.
Visual Coding Generation Converts images or video layouts into HTML, CSS, JS, or diagramming tools like Draw.io.
Spatial & Temporal Reasoning Handles complex visual spatial tasks (2D/3D object grounding, occlusion) and aligns language with video events.
Multimodal Reasoning Excels in STEM, math, and logic tasks with causal, evidence-based answers across text and image/video modalities.
256K+ Context Length Handles ultra-long documents and hours of video input with second-level indexing and full recall.
High-Performance OCR Recognizes 32 languages including ancient scripts, scientific notations, and performs well under low-light/blurry conditions.
Multilingual & Instruction Following Supports over 100 languages with robust multilingual instruction tuning and translation quality.
ποΈ Architecture Details
Model Type: Vision-Language Causal Transformer
Architecture Enhancements:
- Interleaved-MRoPE: Improved positional embeddings for long-horizon vision tasks.
- DeepStack: Multi-level ViT feature fusion for fine-grained alignment.
- Text-Timestamp Alignment: Enhanced video temporal localization.
Context Length: Up to 256K tokens (expandable to 1M)
Model Size: 30B parameters
Architecture: Dense or MoE (Mixture of Experts)
- Downloads last month
- 1,981