Sunday, May 11, 2025

Multi vs Single-agent: Navigating the Realities of Agentic Systems

In March 2025, UC Berkeley released a paper that, IMO, has not been circulated enough. But first, let’s go back to 2024. Everyone was building their first agentic system. I did. You probably did too.

We defined the systems into different maturity levels

Level 1: Knowledge Retrieval - The system queries knowledge sources to fulfill its purpose, but does not perform workflows or tool execution.

Level 2: Workflow Routing - The system follows predefined LLM routing logic to run simple workflows, including tool execution and knowledge retrieval.

Level 3: Tool Selection Agent - The agent chooses and executes tools from an available portfolio according to specific instructions.

Level 4: Multi-Step Agent - The agent combines tools and multi-step workflows based on context and can act autonomously on its judgment.

Level 5: Multi-Agent Orchestrator - The agents independently determine workflows and flexibly invoke other agents to accomplish their purpose.

You might have seen different lists, yet I bet that no matter how many others you have seen, there is a common element: they all depict a multi-agent system as the highest level of sophistication. This approach promises better results through specialization and collaboration.

The elegant theory of multi-agent systems

The multi-agent collaboration model offers several theoretically compelling advantages over single-agent approaches:

Specialization and Expertise: Each agent can be tailored to a specific domain or subtask, leveraging unique strengths. One agent might excel at generating code while another specializes in testing or reviewing it.

Distributed Problem-Solving: Complex tasks can be broken into smaller, manageable pieces that agents tackle in parallel or sequence. By splitting a problem (e.g., travel planning into weather checking, hotel search, route optimization), the system can solve parts independently and more efficiently.

Built-in Error Correction: Multiple agents provide a form of cross-checking. If one agent produces a faulty output, a supervisor or peer agent might detect and correct it, improving reliability.

Scalability and Extensibility: As problem scope grows, it's often easier to add new specialized agents than to retrain or overly complicate a single agent.

The theory maps beautifully to how we understand human teams work: specialized individuals collaborating under coordination often outperform even the most talented generalist. It is every tech CEO’s wet dream: your agent talks to my agents and figures out what to do.

This model has shown remarkable success in some industrial-scale applications:

Then March 2025 landed with a thud!

The Berkeley reality check

A Berkeley-led team assessed the stage of current multi-agent system implementation. They ran seven popular multi-agent systems across over 200 tasks, and identified 14 unique failure modes organized into three categories:

The paper Why Do Multi-Agent LLM Systems Fail? showed that some state-of-the-art systems achieved only 33.33% correctness on seemingly straightforward tasks like implementing Tic-Tac-Toe or Chess games. For AI, this is the equivalent of getting an F.

The paper provides empirical evidence for what many of us have experienced: today's multi-agent implementation is hard. We can’t seem to fulfill the elegant theory outside a demo, with many failures stemming from coordination and communication issues rather than limitations in the underlying LLMs themselves. And if I didn’t make the table above myself, I would think it was a summary of my university assignments.

The disconnection between PR and reality

Looking back, it's easy to see how the industry became so enthusiastic about multi-agent approaches:

  1. Big-cloud benchmarks looked impressive. AWS reported that letting Bedrock agents cooperate raised "marked improvements" in internal task-success and accuracy metrics for complex workflows. I am sure AWS has no hidden agenda here. It is not like they are in a race with anyone, right? Right?
  2. Flagship research prototypes beat single LLMs on niche benchmarks. HuggingGPT and ChatDev each reported better aggregate scores than a single GPT-4 baseline on their chosen tasks. In the same way the show Are You Smarter Than A 5th Grader works. But to be frank, the same argument can be used for the Berkeley paper.
  3. Thought-leaders said the same. Andrew Ng's 2024 "Agentic Design Patterns" talks frame multi-agent collaboration as the pattern that "often beats one big agent" on hard problems.
  4. An analogy we intuitively get. Divide-and-conquer, role specialization, debate for error-catching — all map neatly onto LLM quirks (limited context, hallucinations, etc.). But just with humans, coordination overhead grows exponentially with agent count.
  5. Early adopters were vocal. Start-ups demoed agent-teams creating slide decks, marketing campaigns, and code bases - with little visible human help - which looked like higher autonomy. Till reality’s ugly little details turn this bed of roses into a can of worms.

The Berkeley paper exposed these challenges, but it also pointed toward potential solutions.

Enter A2A: Plumbing made easy

Google's Agent-to-Agent (A2A) protocol arrived in April 2025 with backing from more than 50 launch partners, including Salesforce, Atlassian, LangChain, and MongoDB. While the spec is still a draft, the participation signal is strong: the industry finally has a common transport, discovery, and task-lifecycle layer for autonomous agents.

A2A directly targets 03 of the 14 failure modes identified in the Berkeley audit:

  1. Role/Specification Issues - By standardizing the agent registry, A2A creates clear declarations of capabilities, skills, and endpoints. This addresses failures to follow task requirements and agent roles.
  2. Context Loss Problems - A2A's message chunking and streaming prevent critical information from being lost during handoffs.
  3. Communication Infrastructure - The HTTP + JSON-RPC/SSE protocol with a canonical task lifecycle provides consistent, reliable agent communication.

Several vendors have begun piloting A2A implementations, with promising but still preliminary results. However, it's important to note that quantitative proof remains scarce. None of these vendors has released side-by-side benchmark tables yet—only directional statements or blog interviews. Google's own launch blog shows a candidate-sourcing workflow where one agent hires multiple remote agents, but provides no timing or accuracy metrics. I won’t include names here because I believe we have established earlier that vendor tests can be unpublished, cherry-picked and may have included proprietary orchestration code.

In other words, A2A to agents is like what MCP to tool uses today. And just as supporting MCP doesn’t make the code of your tools better, A2A doesn’t make agents smarter.

What A2A Doesn't Solve

While something like A2A will fix the failures stemming from the lack of a protocol, it cannot fix issues that do not share the same root cause.

  • LLM-inherent limitations - Individual agent hallucinations, reasoning errors, and misunderstandings. Just LLM being LLM.
  • Verification gaps - The lack of formal verification procedures, cross-checking mechanisms, or voting systems. Without verification, you're essentially trusting an AI that thinks 2+2=5 to check the math of another AI that thinks 2+2=banana.
  • Orchestration intelligence - The supervisory logic for retry strategies, error recovery, and termination decisions. Just the other day, Claude was downloading my blog posts, hit a token limit, tried to repeat, continued to hit the token limit, and looped forever.

Those are 11 out of 14 failure modes. These areas require additional innovation beyond standardized communication protocols. Better verification agents, formal verification systems, or improved orchestration frameworks will be needed to address these challenges. I would love to go deeper into the innovation layers required to solve the “cognitive” failure of multi-agent models in another post.

Conclusion

Multi-agent systems are not the universal solution for today’s problem (yet). Many are here and working, but only at a scale where no single-agent approach can reach. These multi-agent systems deliver potential benefits, but at a substantial cost, one that can easily exceed an organization's capacity to invest or find people with the relevant expertise.

Rather than viewing multi-agent as the holy grail to be sought at all costs, we should approach it with careful consideration—an option to explore when simpler approaches fail. The claimed benefits rarely justify the implementation complexity for most everyday use cases, particularly when a well-designed single-agent system with appropriate guardrails can deliver acceptable results at a fraction of the engineering complexity.

Still, the conversation of multi-agent systems will always be around, in increased frequency, especially in light of the arrival of protocols like A2A. MCP is still the hype now, organizations are busy integrating MCP servers and writing their own before turning their attention to the next big thing. A2A could be that. Better agent alignment could also be that. Or it requires a new cognitive paradigm to improve agents’ smartness. Things will take another 6 months to unfold, which is the amount of time since the MCP announcement.

Either way, what’s clear is that it is a long way for agentic systems to hit their theoretical performance plateau.