AWorld Multi-Agent System Hits #1 on GAIA Leaderboard
Abstract
The rapid advancement of large language models (LLMs) has enabled intelligent agents to use a variety of external tools to solve complex real-world problems. However, as agents rely on more tools, they face new challenges. Longer context from multiple sources and the introduction of noisy or irrelevant tool outputs can reduce reliability and accuracy. These issues highlight the need for greater stability in agent-based systems.
To address this, we present a robust Multi-Agent System (MAS) architecture built on the AWorld framework. In our approach, the Execution Agent calls the Guard Agent at key steps to verify and correct the reasoning process. This design helps reduce noise-related errors and improves the robustness of problem-solving.
Through rigorous, controlled experiments on the GAIA test dataset, we show that introducing the Guard Agent significantly improves both the effectiveness and the stability of solutions, compared to Single Agent System or standard tool-augmented systems. Our findings demonstrate the practical value of collaborative agent roles for building more reliable and trustworthy intelligent systems.
Multi-Agent System (MAS) Design and Implementation
Approach
- Developed a Multi-Agent System (MAS) based on the AWorld framework, utilizing an "agent as tool" mechanism and introducing a Guard Agent for logical verification.
- Adaptive intervention: The Execution Agent dynamically determines when to invoke additional agents based on system prompts and contextual analysis.
- Logical validation: The Execution Agent initiates the problem-solving process, while the Guard Agent monitors, corrects, and provides reminders regarding the logical process, thereby improving solution accuracy.
- The Guard Agent is implemented using the same foundational model as the Execution Agent (e.g., Gemini 2.5 Pro), ensuring agent consistency and enhanced collaboration capabilities.
Experiments
Problem Set
- 109 questions from the GAIA test set, split as: L1/L2: 56/53
- Question characteristics:
- Office-related: excel, word, ppt, txt, code, download tools
- Search-related: google search, wiki, etc.
- The experimental setup minimizes the influence of external factors (such as browser instability), providing a consistently controlled environment for comparing different agent construction methodologies.
Experimental Version Design
- Gemini 2.5 Pro: Direct question-answering by a single Gemini 2.5 pro model without tool invocation or agent collaboration.
- Single Agent System (SAS): Single model plus tools, the model can decide, based on the question and context, whether to use external tools or answer independently (Gemini 2.5 pro).
- Multi-Agent System (MAS): Execution Agent + Guard Agent. Built on Single Agent System by incorporating a Guard Agent as an additional candidate tool. The Execution Agent may invoke the Guard Agent for real-time logic verification during problem-solving.
Experimental Running Settings
- Three independent runs of 109 tasks for each version, all utilizing Gemini 2.5 pro with temperature set to 0.1.
- For any task where the answer is invalid due to the incorrect format, that task is re-run until a valid answer is obtained.
- Each version, in each round, reports pass@1 accuracy over the 109 questions. Each version also reports pass@3 accuracy over all runs.
Experimental Results
| Gemini 2.5 Pro | SAS | Gemini 2.5 Pro vs SAS | MAS | SAS vs MAS | |
|---|---|---|---|---|---|
| Round 1 Pass@1 | 32.11% | 57.8% | 71.56% | ||
| Round 2 Pass@1 | 30.28% | 64.22% | 65.14% | ||
| Round 3 Pass@1 | 32.11% | 65.14% | 66.97% | ||
| Pass@3 | 38.53% | 81.65% | +111.91% | 83.49% | +2.25% |
| Pass@1_avg | 31.5% | 62.39% | +98.06% | 67.89% | +8.82% |
| Pass@1_std | 0.00863 | 0.03265 | +278.33% | 0.02701 | -17.3% |
Key Findings:
Integration of a guard agent in the multi-agent setup increases problem-solving accuracy:
- The base model correctly solves an average of 31.5% (pass@1) of questions using internal knowledge and test-time logic.
- Single Agent System, which introduces tool usage alongside the model, significantly boosts accuracy (average pass@1 = 62.39%, a near 2x improvement) by providing context expansion via real-world data acquisition.
- The experimental multi-agent version, with the Guard Agent calibrating key solution steps, further increases accuracy (average pass@1 = 67.89%, an 8.82% increment over Single Agent System; pass@3 = 83.49%, a 2.25% increment).
Incorporating a guard agent also enhances stability:
- The base model's pass@1 standard deviation is 0.0086 at temperature 0.1.
- Single Agent System increases score variance by nearly 4x due to uncertainties introduced by external tools.
- The multi-agent experimental setup, thanks to the Guard Agent's logical constraints, reduces pass@1 standard deviation to 0.027, a 17.3% reduction compared to Single Agent System.
Insights
A Good Q&A Model Does Not Equal a Good Tool User
The base model (Gemini 2.5 pro) solves a significant portion of GAIA tasks out-of-the-box, indicating extensive relevant knowledge obtained during pretraining. However:
- The model cannot reliably determine, for a given problem, whether to rely solely on internal knowledge versus when to invoke external tools.
- Adding tool access may not preserve previous (internal) solution paths. For instance, there exists at least one task that is solved by the base model in pass@3 but not solved by Single Agent System or the experimental version.
Different modes are stimulated by different context:
- The base single-model answers using internal knowledge (akin to "recitation", or zero-order mode) framed by a Q&A prompt.
- In "agent" mode, the system prompt, tool list, and injected outputs construct a run-time context, making the model prioritize external information while potentially suppressing internal knowledge search (akin to first-order reasoning).
- Most models lack sufficient self-awareness to reliably decide when/which mode to use. Thus, a good Q&A model is not automatically a good tool user.
Although the base model already handles a substantial proportion of questions, stable mechanisms for self-driven mode switching are still lacking. Given the experimental observation that tool-integrated agents dramatically improve accuracy, such agent architectures present desirable pathways for generalized intelligent solutions.
Context Optimization and Logical Convergence: The "Second Pair of Eyes" Effect
The introduction of numerous external tools significantly improves problem-solving accuracy, but it also greatly increases context length, placing higher demands on solution stability. Experimental results show that, compared to Gemini 2.5 Pro, the standard deviation of pass@1 for Single Agent System triples.
Drawing inspiration from the "solver–reviewer" multi-agent paradigm in IMO competitions, our approach enables the Execution Agent to call upon the Guard Agent for review. This process essentially shifts the conversational perspective, optimizing the context. When querying the same underlying model, this mechanism prompts the model to focus on logical details that may have previously been blurred by excessively long context. The Guard Agent then generates better prompts as refreshed context for the Execution Agent, helping to reorient its attention and facilitating convergence toward the correct answer. Experiments indicate that introducing the Guard Agent leads to a 17.3 percent reduction in pass@1 standard deviation compared to Single Agent System.
Potential Improvements
The current experimental version serves as a rapid technical validation. There is significant room for enhanced capabilities—for example, enabling the Guard Agent to independently call other tools (such as search engines) for higher-quality cross-validation and further improved stability.
Further research and development can also focus on enhancing the model's capacity for autonomous mode switching. With advances in model architecture, self-reflection mechanisms, and adaptive prompting strategies, future iterations of such systems may be able to more reliably determine when to leverage internal knowledge and when to invoke external tools. This progress could enable AI agents to achieve even greater flexibility, efficiency, and accuracy across a broad spectrum of complex tasks.
Authors
Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu
AWorld Team, Inclusion AI
GitHub Repository
This technical report presents our novel framework and continuous agent learning algorithm, demonstrating their potential for advancing agent intelligence through dynamic, self-evolving learning systems.


