AWorld Multi-Agent System Hits #1 on GAIA Leaderboard

Community Article Published August 7, 2025

Abstract

Multi-Agent System (MAS) Design and Implementation
Approach

Experiments
Problem Set

Experimental Version Design

Experimental Running Settings

Experimental Results

Insights
A Good Q&A Model Does Not Equal a Good Tool User

Context Optimization and Logical Convergence: The "Second Pair of Eyes" Effect

Potential Improvements

Authors

Abstract

The rapid advancement of large language models (LLMs) has enabled intelligent agents to use a variety of external tools to solve complex real-world problems. However, as agents rely on more tools, they face new challenges. Longer context from multiple sources and the introduction of noisy or irrelevant tool outputs can reduce reliability and accuracy. These issues highlight the need for greater stability in agent-based systems.

To address this, we present a robust Multi-Agent System (MAS) architecture built on the AWorld framework. In our approach, the Execution Agent calls the Guard Agent at key steps to verify and correct the reasoning process. This design helps reduce noise-related errors and improves the robustness of problem-solving.

Through rigorous, controlled experiments on the GAIA test dataset, we show that introducing the Guard Agent significantly improves both the effectiveness and the stability of solutions, compared to Single Agent System or standard tool-augmented systems. Our findings demonstrate the practical value of collaborative agent roles for building more reliable and trustworthy intelligent systems.

Multi-Agent System (MAS) Design and Implementation

Approach

Developed a Multi-Agent System (MAS) based on the AWorld framework, utilizing an "agent as tool" mechanism and introducing a Guard Agent for logical verification.
Adaptive intervention: The Execution Agent dynamically determines when to invoke additional agents based on system prompts and contextual analysis.
Logical validation: The Execution Agent initiates the problem-solving process, while the Guard Agent monitors, corrects, and provides reminders regarding the logical process, thereby improving solution accuracy.
The Guard Agent is implemented using the same foundational model as the Execution Agent (e.g., Gemini 2.5 Pro), ensuring agent consistency and enhanced collaboration capabilities.

Experiments

Problem Set

109 questions from the GAIA test set, split as: L1/L2: 56/53
Question characteristics:
- Office-related: excel, word, ppt, txt, code, download tools
- Search-related: google search, wiki, etc.
The experimental setup minimizes the influence of external factors (such as browser instability), providing a consistently controlled environment for comparing different agent construction methodologies.

Experimental Version Design

Gemini 2.5 Pro: Direct question-answering by a single Gemini 2.5 pro model without tool invocation or agent collaboration.
Single Agent System (SAS): Single model plus tools, the model can decide, based on the question and context, whether to use external tools or answer independently (Gemini 2.5 pro).
Multi-Agent System (MAS): Execution Agent + Guard Agent. Built on Single Agent System by incorporating a Guard Agent as an additional candidate tool. The Execution Agent may invoke the Guard Agent for real-time logic verification during problem-solving.

Experimental Running Settings

Three independent runs of 109 tasks for each version, all utilizing Gemini 2.5 pro with temperature set to 0.1.
For any task where the answer is invalid due to the incorrect format, that task is re-run until a valid answer is obtained.
Each version, in each round, reports pass@1 accuracy over the 109 questions. Each version also reports pass@3 accuracy over all runs.

Experimental Results

	Gemini 2.5 Pro	SAS	Gemini 2.5 Pro vs SAS	MAS	SAS vs MAS
Round 1 Pass@1	32.11%	57.8%		71.56%
Round 2 Pass@1	30.28%	64.22%		65.14%
Round 3 Pass@1	32.11%	65.14%		66.97%
Pass@3	38.53%	81.65%	+111.91%	83.49%	+2.25%
Pass@1_avg	31.5%	62.39%	+98.06%	67.89%	+8.82%
Pass@1_std	0.00863	0.03265	+278.33%	0.02701	-17.3%

Key Findings:

Integration of a guard agent in the multi-agent setup increases problem-solving accuracy:
- The base model correctly solves an average of 31.5% (pass@1) of questions using internal knowledge and test-time logic.
- Single Agent System, which introduces tool usage alongside the model, significantly boosts accuracy (average pass@1 = 62.39%, a near 2x improvement) by providing context expansion via real-world data acquisition.
- The experimental multi-agent version, with the Guard Agent calibrating key solution steps, further increases accuracy (average pass@1 = 67.89%, an 8.82% increment over Single Agent System; pass@3 = 83.49%, a 2.25% increment).
Incorporating a guard agent also enhances stability:
- The base model's pass@1 standard deviation is 0.0086 at temperature 0.1.
- Single Agent System increases score variance by nearly 4x due to uncertainties introduced by external tools.
- The multi-agent experimental setup, thanks to the Guard Agent's logical constraints, reduces pass@1 standard deviation to 0.027, a 17.3% reduction compared to Single Agent System.

Insights

A Good Q&A Model Does Not Equal a Good Tool User

The base model (Gemini 2.5 pro) solves a significant portion of GAIA tasks out-of-the-box, indicating extensive relevant knowledge obtained during pretraining. However:

The model cannot reliably determine, for a given problem, whether to rely solely on internal knowledge versus when to invoke external tools.
Adding tool access may not preserve previous (internal) solution paths. For instance, there exists at least one task that is solved by the base model in pass@3 but not solved by Single Agent System or the experimental version.

Different modes are stimulated by different context:

The base single-model answers using internal knowledge (akin to "recitation", or zero-order mode) framed by a Q&A prompt.
In "agent" mode, the system prompt, tool list, and injected outputs construct a run-time context, making the model prioritize external information while potentially suppressing internal knowledge search (akin to first-order reasoning).
Most models lack sufficient self-awareness to reliably decide when/which mode to use. Thus, a good Q&A model is not automatically a good tool user.

Although the base model already handles a substantial proportion of questions, stable mechanisms for self-driven mode switching are still lacking. Given the experimental observation that tool-integrated agents dramatically improve accuracy, such agent architectures present desirable pathways for generalized intelligent solutions.

Context Optimization and Logical Convergence: The "Second Pair of Eyes" Effect

The introduction of numerous external tools significantly improves problem-solving accuracy, but it also greatly increases context length, placing higher demands on solution stability. Experimental results show that, compared to Gemini 2.5 Pro, the standard deviation of pass@1 for Single Agent System triples.

Drawing inspiration from the "solver–reviewer" multi-agent paradigm in IMO competitions, our approach enables the Execution Agent to call upon the Guard Agent for review. This process essentially shifts the conversational perspective, optimizing the context. When querying the same underlying model, this mechanism prompts the model to focus on logical details that may have previously been blurred by excessively long context. The Guard Agent then generates better prompts as refreshed context for the Execution Agent, helping to reorient its attention and facilitating convergence toward the correct answer. Experiments indicate that introducing the Guard Agent leads to a 17.3 percent reduction in pass@1 standard deviation compared to Single Agent System.

Potential Improvements

The current experimental version serves as a rapid technical validation. There is significant room for enhanced capabilities—for example, enabling the Guard Agent to independently call other tools (such as search engines) for higher-quality cross-validation and further improved stability.

Further research and development can also focus on enhancing the model's capacity for autonomous mode switching. With advances in model architecture, self-reflection mechanisms, and adaptive prompting strategies, future iterations of such systems may be able to more reliably determine when to leverage internal knowledge and when to invoke external tools. This progress could enable AI agents to achieve even greater flexibility, efficiency, and accuracy across a broad spectrum of complex tasks.

Authors

Zhitian Xie, Qintong Wu, Chengyue Yu, Chenyi Zhuang, Jinjie Gu
AWorld Team, Inclusion AI
GitHub Repository

This technical report presents our novel framework and continuous agent learning algorithm, demonstrating their potential for advancing agent intelligence through dynamic, self-evolving learning systems.