I was designing an AI-powered workflow for Product Management. The flow looks somewhat like this.
The details of the flow are largely not relevant to the topic of this post. The important gist is: it is an end-to-end flow that would be too big to fit in most modern LLMs’ context window. As the flow has to be broken down into sequential phases, it lends nicely to a multi-agent setup with intermediate artifacts to assist communication between one agent to another. It was the AI as Judge part that reminded me, just as a developer cannot reliably test their work, an agent should not be trusted to judge its own work.
"Act as a critical Product Requirement Adjudicator and evaluate the following product requirement for clarity, completeness, user value, and potential risks. [Paste your product requirement here]. Provide specific, actionable feedback to help me improve its robustness and effectiveness."
(A) I have a multi-agent setup. (B) I have a legitimate reason to introduce more than one LLM to reduce model blind spots. The next logical step is to decide which model makes the best agent.
An empirical comparison
While composing the aforementioned workflow, I fed the same prompts to Claude 3.7, Gemini 2.5 Pro (preview), and ChatGPT-o3. All in their native environments (desktop app for Claude and ChatGPT, web page for Gemini). I first asked the agent to passively consume my note while waiting for my cue to participate. For each note, I wrote one use case where I think GenAI is beneficial to the productivity of a product manager. When the source of the idea was from the Internet, I provided a link(s). Once my notes were exhausted, I asked the agent to perform a search to find other novel uses of GenAI in product management that I might have missed. Finally, with the combined inputs from me and web search, I asked the agent to compose a comprehensive report. I ended up with three conversations and three reports. Below are my observations on the pros and cons of each. An asterisk (*) marks the clear winner.
LLM as a thought partner
This is probably the most subjective section of this subjective comparison. Feel free to let me know if you disagree.
ChatGPT:
I didn’t like ChatGPT as a partner. Its feedback sometimes focused on form over substance - whether or not I follow the feedback makes no difference to my work. ChatGPT was also stubborn and prone to hallucination, a dangerous combination. It kept convincing me that Substack supports some form of table, first suggesting a markdown format, then insisting on a non-existent HTML mode.
Claude *:
I dig Claude (with certain caveats explored below). Its feedback had a good hit-to-miss ratio. The future-proof section at the end of this post was its suggestion (the writing mine). There seemed to be fewer hallucinations too. Empirically speaking, I had fewer “this can’t be right” moments.
Gemini:
Gemini’s feedback was not as vain as ChatGPT’s but limited in creativity, and had I followed, I doubt it would have made my work drastically different. Gemini gave me a glimpse into what a good LLM can do, yet left me desiring more.
Web search
ChatGPT *:
The web search was genuinely good. It started with my initial direction and covered a good ground of internet knowledge. The output contributes new insights that didn’t exist in my initial notes while managing to stay on course with the topic.
Claude:
The web search seemed rather limited. Instead of including novel uses, it reinforced earlier discussions with new data points. More of the same. Good for validation but bad for new insights. Furthermore, Claude’s internal knowledge is limited to Oct 2024, it could list half a dozen acronyms of MCP, except Model Context Protocol!
Gemini *:
Gemini web search was good, it performed on par with ChatGPT, even faster.
Deep thinking
Aka extended thinking. Aka deep research. You know it when you see one.
ChatGPT *:
Before entering the deep thinking mode, it would ask clarification questions. These questions have proven to be very useful in steering the agent in the right direction and saving time. The agent would show the thinking process while it is happening, but hide it once done. Fun fact, ChatGPT used to hide this thinking process completely until DeepSeek exposed it, “nudging” the OG reasoning model to follow suit.
Claude:
Extended thinking was still in preview. It didn’t ask for clarification, instead uncharacteristically (to Claude) jumped head-on into the thinking mode. It also scanned through an excessive number of sources. Compared to the other 2 agents, it didn’t seem to know when to stop. In that process, it consumed an excessive number of tokens. This however is every much characteristically Claude. I am on the Team license, and my daily quota limit was consistently depleted after 3-4 questions. Claude did show and keep displaying the thinking process.
Gemini:
Deep research did not ask clarification questions, it showed the research plan and asked if I would like to modify. Post confirmation, it launched the deep research mode. Arguably not as friendly as ChatGPT’s approach but good for power users. The thinking process of Gemini was the most details and displayed on the most friendly UI among the bunch. However it was very slow. Ironically, Gemini didn’t deal with disconnection very well, even though at that speed, deep research should be an async task.
Context window
ChatGPT: 128-200k
The agent implemented some sort of rolling context window where earlier chat is summarized in the agent’s memory. The token never ran out during my sessions, but this approach is well-known for collapsing context where the summary misses earlier details, and the agent starts to fabricate facts to fill in the gap.
Claude: 200k
Not bad. Yet reportedly the way Claude chunks its text is not as token-efficient as ChatGPT. When the context window exceeded, I had to start a new chat, effectively losing all my previous content unless I made intermediate artifacts beforehand to capture the data. I lived in the constant fear of running out of tokens, similar to using a desktop in electricity-deficient Vietnam during the 90s.
Gemini *: 1M!
In none of my sessions did I manage to exceed this limit. I don’t know if Gemini simply forgets earlier chat, summarizes it, or requires a new chat. Google (via Gemini web search) does not disclose the precise mechanism it chooses to handle extremely long chat sessions.
Human-in-the-loop collaboration
ChatGPT:
ChatGPT did not voluntarily show Canvas when I was on the app but would bring it up when requested. I could highlight any part of the document and ask the agent to improve or explain it. There were some simple styling options. It showed changes in Git diff style. ChatGPT chart was not part of an editable Canvas. Finally, it did not give me a list of all Canvases. They were just buried in the chat.
Claude *:
Claude’s is called Artifacts. Claude would make an artifact whenever it feels appropriate or is requested. Any part can be revised or explained via highlighting. Any change would create a version. I couldn’t edit the artifact directly though. Understandably, there were also no styling options. Not only could Artifact display code, it also executed JavaScript to generate charts, making Claude’s chart the most customizable in the bunch. To be clear, ChatGPT could show charts, it just did so right in the chat as a response.
Gemini:
Gemini’s Canvas copied ChatGPT’s homework, all the good bits and bad bits. I had to summon it, no Canvas list, no versions but I could undo. Chart rendering however was the same as Claude's - HTML was generated. Gemini added slightly more advanced styling options and a button to export to Google Docs. Perhaps it was just me, but I had several Canvas pages that went from being full of content to blank randomly, and there was nothing I could do to fix the glitch. It was madness.
Cross-app collaboration:
ChatGPT:
ChatGPT has some proprietary integrations with various desktop apps but the interoperability is nowhere as extensive as MCP. For example, where the MCP filesystem can edit any files within its scope of permission, ChatGPT can only “look” at a terminal screen. MCP support was announced in April 2025, this might be improved soon.
Claude *:
MCP! Need I say more?
Gemini:
It is a web page, it is a closed garden with only Google Workplace integrations (“only” for many, this might be the number one reason to stick with Gemini). It is also unclear to me if Google’s intended usage for Gemini the LLM is via NotebookLM, the Gemini webpage, or the Gemini widget in other products. MCP support was announced but being a web page, it will be limited to remote MCP servers.
Characteristics
ChatGPT:
A (over)confident colleague. It is designed to be an independent, competent entity with limited collaboration with the user. other than receiving input. Excellent at handling asynchronous tasks. Also known for excessive sycophancy (even before the rollback).
Claude:
A calm, collected intern. It is designed to be a thought partner working alongside the user. Artifacts and MCPs reinforce this stereotype. It works best with internal docs. Web search/extended thinking need much refinement.
Gemini:
An enabling analyst. Gemini’s research plan and co-edit canvas place a lot of power into the hands of its users. It requires an engaging session to deliver a mindblowing result. Also because if you turn away from it, the deep research session might die, and you have to start again.
Putting the “AI Crew” Together
With that comparison in mind, my today AI crew for product management looks like this.
- Researcher:
- Web-heavy discovery: ChatGPT
- Huge bounded internal doc: Claude
- Risk assessor: Claude (Claude works well with bounded context)
- Data analyst:
- Claude with a handful of MCPs
- Gemini if the data is in spreadsheets
- Backlog optimizer: Claude with Atlassian MCP
- Draft writer:
- ChatGPT for one-shot work and/or if I know little about the topic
- Claude/Gemini if I’m gonna edit later
- Editor/Reviewer: Claude Artifacts = Gemini Canvas
- Publisher: Confluence MCP / Manual save
Some rules of thumb
- Claude works well with a definite set of documentation. If discovery is needed, ChatGPT and Gemini are better.
- Between ChatGPT and Gemini, the former is suitable for less technical users who are less likely to engage in a conversation with AI.
- In some cases, Claude is selected simply because the use case calls for MCP, and it is the only consumer app supporting MCP today. Also because of this, for tasks all 3 LLMs perform equally well, I go with Claude for later interoperability.
Future proof
When I finished the rules of thumb, I had an unsettling feeling, this would not be the end of this. As the agents evolve, and they do - neck-breaking fast, this guide will soon become obsolete. I don’t plan to keep the guide forever correct, nor am I able to. But I can lay out the ground rules to keep reshuffling and reevaluating the agents as new improvements arrive.
- Periodic review - What was Claude's weakness yesterday might be its strength tomorrow (looking at you, extended thinking). Meanwhile, a model's strength could become commoditized, diluting its specialization value. Set a quarterly review cadence to assess if the current AI assignments still make sense. Even better if the review can be done as soon as a release drops.
- Maintain a repeatable test - Develop a standardized benchmark that reflects your actual work. Run identical prompts through different models periodically to compare outputs objectively. This preserves your most effective prompts while creating a consistent evaluation framework. Your human judgment of these comparative results becomes the compass that guides crew reorganization as models evolve. If the use case evolves, evolve the test as well.
- Build platform-independent intermediate artifacts - As workflows travel between AI platforms, establish standardized formats for handoff artifacts (PRDs, market analyses, etc.). This reduces lock-in and makes crew substitution painless. A good intermediate artifact should work regardless of which model produced it or which model will consume it next. In my use case of product management, some of the artifacts are stored in Confluence.