Thursday, May 29, 2025

LLM Agents Comparison

 I was designing an AI-powered workflow for Product Management. The flow looks somewhat like this.


The details of the flow are largely not relevant to the topic of this post. The important gist is: it is an end-to-end flow that would be too big to fit in most modern LLMs’ context window. As the flow has to be broken down into sequential phases, it lends nicely to a multi-agent setup with intermediate artifacts to assist communication between one agent to another. It was the AI as Judge part that reminded me, just as a developer cannot reliably test their work, an agent should not be trusted to judge its own work.

"Act as a critical Product Requirement Adjudicator and evaluate the following product requirement for clarity, completeness, user value, and potential risks. [Paste your product requirement here]. Provide specific, actionable feedback to help me improve its robustness and effectiveness."

(A) I have a multi-agent setup. (B) I have a legitimate reason to introduce more than one LLM to reduce model blind spots. The next logical step is to decide which model makes the best agent.

An empirical comparison

While composing the aforementioned workflow, I fed the same prompts to Claude 3.7, Gemini 2.5 Pro (preview), and ChatGPT-o3. All in their native environments (desktop app for Claude and ChatGPT, web page for Gemini). I first asked the agent to passively consume my note while waiting for my cue to participate. For each note, I wrote one use case where I think GenAI is beneficial to the productivity of a product manager. When the source of the idea was from the Internet, I provided a link(s). Once my notes were exhausted, I asked the agent to perform a search to find other novel uses of GenAI in product management that I might have missed. Finally, with the combined inputs from me and web search, I asked the agent to compose a comprehensive report. I ended up with three conversations and three reports. Below are my observations on the pros and cons of each. An asterisk (*) marks the clear winner.

LLM as a thought partner

This is probably the most subjective section of this subjective comparison. Feel free to let me know if you disagree.

ChatGPT:

I didn’t like ChatGPT as a partner. Its feedback sometimes focused on form over substance - whether or not I follow the feedback makes no difference to my work. ChatGPT was also stubborn and prone to hallucination, a dangerous combination. It kept convincing me that Substack supports some form of table, first suggesting a markdown format, then insisting on a non-existent HTML mode.

Claude *:

I dig Claude (with certain caveats explored below). Its feedback had a good hit-to-miss ratio. The future-proof section at the end of this post was its suggestion (the writing mine). There seemed to be fewer hallucinations too. Empirically speaking, I had fewer “this can’t be right” moments.

Gemini:

Gemini’s feedback was not as vain as ChatGPT’s but limited in creativity, and had I followed, I doubt it would have made my work drastically different. Gemini gave me a glimpse into what a good LLM can do, yet left me desiring more.

Web search

ChatGPT *:

The web search was genuinely good. It started with my initial direction and covered a good ground of internet knowledge. The output contributes new insights that didn’t exist in my initial notes while managing to stay on course with the topic.

Claude:

The web search seemed rather limited. Instead of including novel uses, it reinforced earlier discussions with new data points. More of the same. Good for validation but bad for new insights. Furthermore, Claude’s internal knowledge is limited to Oct 2024, it could list half a dozen acronyms of MCP, except Model Context Protocol!

Gemini *:

Gemini web search was good, it performed on par with ChatGPT, even faster.

Deep thinking

Aka extended thinking. Aka deep research. You know it when you see one.

ChatGPT *:

Before entering the deep thinking mode, it would ask clarification questions. These questions have proven to be very useful in steering the agent in the right direction and saving time. The agent would show the thinking process while it is happening, but hide it once done. Fun fact, ChatGPT used to hide this thinking process completely until DeepSeek exposed it, “nudging” the OG reasoning model to follow suit.

Claude:

Extended thinking was still in preview. It didn’t ask for clarification, instead uncharacteristically (to Claude) jumped head-on into the thinking mode. It also scanned through an excessive number of sources. Compared to the other 2 agents, it didn’t seem to know when to stop. In that process, it consumed an excessive number of tokens. This however is every much characteristically Claude. I am on the Team license, and my daily quota limit was consistently depleted after 3-4 questions. Claude did show and keep displaying the thinking process.

Gemini:

Deep research did not ask clarification questions, it showed the research plan and asked if I would like to modify. Post confirmation, it launched the deep research mode. Arguably not as friendly as ChatGPT’s approach but good for power users. The thinking process of Gemini was the most details and displayed on the most friendly UI among the bunch. However it was very slow. Ironically, Gemini didn’t deal with disconnection very well, even though at that speed, deep research should be an async task.

Context window

ChatGPT: 128-200k

The agent implemented some sort of rolling context window where earlier chat is summarized in the agent’s memory. The token never ran out during my sessions, but this approach is well-known for collapsing context where the summary misses earlier details, and the agent starts to fabricate facts to fill in the gap.

Claude: 200k

Not bad. Yet reportedly the way Claude chunks its text is not as token-efficient as ChatGPT. When the context window exceeded, I had to start a new chat, effectively losing all my previous content unless I made intermediate artifacts beforehand to capture the data. I lived in the constant fear of running out of tokens, similar to using a desktop in electricity-deficient Vietnam during the 90s.

Gemini *: 1M!

In none of my sessions did I manage to exceed this limit. I don’t know if Gemini simply forgets earlier chat, summarizes it, or requires a new chat. Google (via Gemini web search) does not disclose the precise mechanism it chooses to handle extremely long chat sessions.

Human-in-the-loop collaboration

ChatGPT:

ChatGPT did not voluntarily show Canvas when I was on the app but would bring it up when requested. I could highlight any part of the document and ask the agent to improve or explain it. There were some simple styling options. It showed changes in Git diff style. ChatGPT chart was not part of an editable Canvas. Finally, it did not give me a list of all Canvases. They were just buried in the chat.

Claude *:

Claude’s is called Artifacts. Claude would make an artifact whenever it feels appropriate or is requested. Any part can be revised or explained via highlighting. Any change would create a version. I couldn’t edit the artifact directly though. Understandably, there were also no styling options. Not only could Artifact display code, it also executed JavaScript to generate charts, making Claude’s chart the most customizable in the bunch. To be clear, ChatGPT could show charts, it just did so right in the chat as a response.

Gemini:

Gemini’s Canvas copied ChatGPT’s homework, all the good bits and bad bits. I had to summon it, no Canvas list, no versions but I could undo. Chart rendering however was the same as Claude's - HTML was generated. Gemini added slightly more advanced styling options and a button to export to Google Docs. Perhaps it was just me, but I had several Canvas pages that went from being full of content to blank randomly, and there was nothing I could do to fix the glitch. It was madness.

Cross-app collaboration:

ChatGPT:

ChatGPT has some proprietary integrations with various desktop apps but the interoperability is nowhere as extensive as MCP. For example, where the MCP filesystem can edit any files within its scope of permission, ChatGPT can only “look” at a terminal screen. MCP support was announced in April 2025, this might be improved soon.

Claude *:

MCP! Need I say more?

Gemini:

It is a web page, it is a closed garden with only Google Workplace integrations (“only” for many, this might be the number one reason to stick with Gemini). It is also unclear to me if Google’s intended usage for Gemini the LLM is via NotebookLM, the Gemini webpage, or the Gemini widget in other products. MCP support was announced but being a web page, it will be limited to remote MCP servers.

Characteristics

ChatGPT:

A (over)confident colleague. It is designed to be an independent, competent entity with limited collaboration with the user. other than receiving input. Excellent at handling asynchronous tasks. Also known for excessive sycophancy (even before the rollback).

Claude:

A calm, collected intern. It is designed to be a thought partner working alongside the user. Artifacts and MCPs reinforce this stereotype. It works best with internal docs. Web search/extended thinking need much refinement.

Gemini:

An enabling analyst. Gemini’s research plan and co-edit canvas place a lot of power into the hands of its users. It requires an engaging session to deliver a mindblowing result. Also because if you turn away from it, the deep research session might die, and you have to start again.

Putting the “AI Crew” Together

With that comparison in mind, my today AI crew for product management looks like this.

  • Researcher:
    • Web-heavy discovery: ChatGPT
    • Huge bounded internal doc: Claude
  • Risk assessor: Claude (Claude works well with bounded context)
  • Data analyst:
    • Claude with a handful of MCPs
    • Gemini if the data is in spreadsheets
  • Backlog optimizer: Claude with Atlassian MCP
  • Draft writer:
    • ChatGPT for one-shot work and/or if I know little about the topic
    • Claude/Gemini if I’m gonna edit later
  • Editor/Reviewer: Claude Artifacts = Gemini Canvas
  • Publisher: Confluence MCP / Manual save

Some rules of thumb

  • Claude works well with a definite set of documentation. If discovery is needed, ChatGPT and Gemini are better.
  • Between ChatGPT and Gemini, the former is suitable for less technical users who are less likely to engage in a conversation with AI.
  • In some cases, Claude is selected simply because the use case calls for MCP, and it is the only consumer app supporting MCP today. Also because of this, for tasks all 3 LLMs perform equally well, I go with Claude for later interoperability.

Future proof

When I finished the rules of thumb, I had an unsettling feeling, this would not be the end of this. As the agents evolve, and they do - neck-breaking fast, this guide will soon become obsolete. I don’t plan to keep the guide forever correct, nor am I able to. But I can lay out the ground rules to keep reshuffling and reevaluating the agents as new improvements arrive.

  • Periodic review - What was Claude's weakness yesterday might be its strength tomorrow (looking at you, extended thinking). Meanwhile, a model's strength could become commoditized, diluting its specialization value. Set a quarterly review cadence to assess if the current AI assignments still make sense. Even better if the review can be done as soon as a release drops.
  • Maintain a repeatable test - Develop a standardized benchmark that reflects your actual work. Run identical prompts through different models periodically to compare outputs objectively. This preserves your most effective prompts while creating a consistent evaluation framework. Your human judgment of these comparative results becomes the compass that guides crew reorganization as models evolve. If the use case evolves, evolve the test as well.
  • Build platform-independent intermediate artifacts - As workflows travel between AI platforms, establish standardized formats for handoff artifacts (PRDs, market analyses, etc.). This reduces lock-in and makes crew substitution painless. A good intermediate artifact should work regardless of which model produced it or which model will consume it next. In my use case of product management, some of the artifacts are stored in Confluence.


 

Sunday, May 11, 2025

Multi vs Single-agent: Navigating the Realities of Agentic Systems

In March 2025, UC Berkeley released a paper that, IMO, has not been circulated enough. But first, let’s go back to 2024. Everyone was building their first agentic system. I did. You probably did too.

We defined the systems into different maturity levels

Level 1: Knowledge Retrieval - The system queries knowledge sources to fulfill its purpose, but does not perform workflows or tool execution.

Level 2: Workflow Routing - The system follows predefined LLM routing logic to run simple workflows, including tool execution and knowledge retrieval.

Level 3: Tool Selection Agent - The agent chooses and executes tools from an available portfolio according to specific instructions.

Level 4: Multi-Step Agent - The agent combines tools and multi-step workflows based on context and can act autonomously on its judgment.

Level 5: Multi-Agent Orchestrator - The agents independently determine workflows and flexibly invoke other agents to accomplish their purpose.

You might have seen different lists, yet I bet that no matter how many others you have seen, there is a common element: they all depict a multi-agent system as the highest level of sophistication. This approach promises better results through specialization and collaboration.

The elegant theory of multi-agent systems

The multi-agent collaboration model offers several theoretically compelling advantages over single-agent approaches:

Specialization and Expertise: Each agent can be tailored to a specific domain or subtask, leveraging unique strengths. One agent might excel at generating code while another specializes in testing or reviewing it.

Distributed Problem-Solving: Complex tasks can be broken into smaller, manageable pieces that agents tackle in parallel or sequence. By splitting a problem (e.g., travel planning into weather checking, hotel search, route optimization), the system can solve parts independently and more efficiently.

Built-in Error Correction: Multiple agents provide a form of cross-checking. If one agent produces a faulty output, a supervisor or peer agent might detect and correct it, improving reliability.

Scalability and Extensibility: As problem scope grows, it's often easier to add new specialized agents than to retrain or overly complicate a single agent.

The theory maps beautifully to how we understand human teams work: specialized individuals collaborating under coordination often outperform even the most talented generalist. It is every tech CEO’s wet dream: your agent talks to my agents and figures out what to do.

This model has shown remarkable success in some industrial-scale applications:

Then March 2025 landed with a thud!

The Berkeley reality check

A Berkeley-led team assessed the stage of current multi-agent system implementation. They ran seven popular multi-agent systems across over 200 tasks, and identified 14 unique failure modes organized into three categories:

The paper Why Do Multi-Agent LLM Systems Fail? showed that some state-of-the-art systems achieved only 33.33% correctness on seemingly straightforward tasks like implementing Tic-Tac-Toe or Chess games. For AI, this is the equivalent of getting an F.

The paper provides empirical evidence for what many of us have experienced: today's multi-agent implementation is hard. We can’t seem to fulfill the elegant theory outside a demo, with many failures stemming from coordination and communication issues rather than limitations in the underlying LLMs themselves. And if I didn’t make the table above myself, I would think it was a summary of my university assignments.

The disconnection between PR and reality

Looking back, it's easy to see how the industry became so enthusiastic about multi-agent approaches:

  1. Big-cloud benchmarks looked impressive. AWS reported that letting Bedrock agents cooperate raised "marked improvements" in internal task-success and accuracy metrics for complex workflows. I am sure AWS has no hidden agenda here. It is not like they are in a race with anyone, right? Right?
  2. Flagship research prototypes beat single LLMs on niche benchmarks. HuggingGPT and ChatDev each reported better aggregate scores than a single GPT-4 baseline on their chosen tasks. In the same way the show Are You Smarter Than A 5th Grader works. But to be frank, the same argument can be used for the Berkeley paper.
  3. Thought-leaders said the same. Andrew Ng's 2024 "Agentic Design Patterns" talks frame multi-agent collaboration as the pattern that "often beats one big agent" on hard problems.
  4. An analogy we intuitively get. Divide-and-conquer, role specialization, debate for error-catching — all map neatly onto LLM quirks (limited context, hallucinations, etc.). But just with humans, coordination overhead grows exponentially with agent count.
  5. Early adopters were vocal. Start-ups demoed agent-teams creating slide decks, marketing campaigns, and code bases - with little visible human help - which looked like higher autonomy. Till reality’s ugly little details turn this bed of roses into a can of worms.

The Berkeley paper exposed these challenges, but it also pointed toward potential solutions.

Enter A2A: Plumbing made easy

Google's Agent-to-Agent (A2A) protocol arrived in April 2025 with backing from more than 50 launch partners, including Salesforce, Atlassian, LangChain, and MongoDB. While the spec is still a draft, the participation signal is strong: the industry finally has a common transport, discovery, and task-lifecycle layer for autonomous agents.

A2A directly targets 03 of the 14 failure modes identified in the Berkeley audit:

  1. Role/Specification Issues - By standardizing the agent registry, A2A creates clear declarations of capabilities, skills, and endpoints. This addresses failures to follow task requirements and agent roles.
  2. Context Loss Problems - A2A's message chunking and streaming prevent critical information from being lost during handoffs.
  3. Communication Infrastructure - The HTTP + JSON-RPC/SSE protocol with a canonical task lifecycle provides consistent, reliable agent communication.

Several vendors have begun piloting A2A implementations, with promising but still preliminary results. However, it's important to note that quantitative proof remains scarce. None of these vendors has released side-by-side benchmark tables yet—only directional statements or blog interviews. Google's own launch blog shows a candidate-sourcing workflow where one agent hires multiple remote agents, but provides no timing or accuracy metrics. I won’t include names here because I believe we have established earlier that vendor tests can be unpublished, cherry-picked and may have included proprietary orchestration code.

In other words, A2A to agents is like what MCP to tool uses today. And just as supporting MCP doesn’t make the code of your tools better, A2A doesn’t make agents smarter.

What A2A Doesn't Solve

While something like A2A will fix the failures stemming from the lack of a protocol, it cannot fix issues that do not share the same root cause.

  • LLM-inherent limitations - Individual agent hallucinations, reasoning errors, and misunderstandings. Just LLM being LLM.
  • Verification gaps - The lack of formal verification procedures, cross-checking mechanisms, or voting systems. Without verification, you're essentially trusting an AI that thinks 2+2=5 to check the math of another AI that thinks 2+2=banana.
  • Orchestration intelligence - The supervisory logic for retry strategies, error recovery, and termination decisions. Just the other day, Claude was downloading my blog posts, hit a token limit, tried to repeat, continued to hit the token limit, and looped forever.

Those are 11 out of 14 failure modes. These areas require additional innovation beyond standardized communication protocols. Better verification agents, formal verification systems, or improved orchestration frameworks will be needed to address these challenges. I would love to go deeper into the innovation layers required to solve the “cognitive” failure of multi-agent models in another post.

Conclusion

Multi-agent systems are not the universal solution for today’s problem (yet). Many are here and working, but only at a scale where no single-agent approach can reach. These multi-agent systems deliver potential benefits, but at a substantial cost, one that can easily exceed an organization's capacity to invest or find people with the relevant expertise.

Rather than viewing multi-agent as the holy grail to be sought at all costs, we should approach it with careful consideration—an option to explore when simpler approaches fail. The claimed benefits rarely justify the implementation complexity for most everyday use cases, particularly when a well-designed single-agent system with appropriate guardrails can deliver acceptable results at a fraction of the engineering complexity.

Still, the conversation of multi-agent systems will always be around, in increased frequency, especially in light of the arrival of protocols like A2A. MCP is still the hype now, organizations are busy integrating MCP servers and writing their own before turning their attention to the next big thing. A2A could be that. Better agent alignment could also be that. Or it requires a new cognitive paradigm to improve agents’ smartness. Things will take another 6 months to unfold, which is the amount of time since the MCP announcement.

Either way, what’s clear is that it is a long way for agentic systems to hit their theoretical performance plateau.