Thursday, May 29, 2025

LLM Agents Comparison

 I was designing an AI-powered workflow for Product Management. The flow looks somewhat like this.


The details of the flow are largely not relevant to the topic of this post. The important gist is: it is an end-to-end flow that would be too big to fit in most modern LLMs’ context window. As the flow has to be broken down into sequential phases, it lends nicely to a multi-agent setup with intermediate artifacts to assist communication between one agent to another. It was the AI as Judge part that reminded me, just as a developer cannot reliably test their work, an agent should not be trusted to judge its own work.

"Act as a critical Product Requirement Adjudicator and evaluate the following product requirement for clarity, completeness, user value, and potential risks. [Paste your product requirement here]. Provide specific, actionable feedback to help me improve its robustness and effectiveness."

(A) I have a multi-agent setup. (B) I have a legitimate reason to introduce more than one LLM to reduce model blind spots. The next logical step is to decide which model makes the best agent.

An empirical comparison

While composing the aforementioned workflow, I fed the same prompts to Claude 3.7, Gemini 2.5 Pro (preview), and ChatGPT-o3. All in their native environments (desktop app for Claude and ChatGPT, web page for Gemini). I first asked the agent to passively consume my note while waiting for my cue to participate. For each note, I wrote one use case where I think GenAI is beneficial to the productivity of a product manager. When the source of the idea was from the Internet, I provided a link(s). Once my notes were exhausted, I asked the agent to perform a search to find other novel uses of GenAI in product management that I might have missed. Finally, with the combined inputs from me and web search, I asked the agent to compose a comprehensive report. I ended up with three conversations and three reports. Below are my observations on the pros and cons of each. An asterisk (*) marks the clear winner.

LLM as a thought partner

This is probably the most subjective section of this subjective comparison. Feel free to let me know if you disagree.

ChatGPT:

I didn’t like ChatGPT as a partner. Its feedback sometimes focused on form over substance - whether or not I follow the feedback makes no difference to my work. ChatGPT was also stubborn and prone to hallucination, a dangerous combination. It kept convincing me that Substack supports some form of table, first suggesting a markdown format, then insisting on a non-existent HTML mode.

Claude *:

I dig Claude (with certain caveats explored below). Its feedback had a good hit-to-miss ratio. The future-proof section at the end of this post was its suggestion (the writing mine). There seemed to be fewer hallucinations too. Empirically speaking, I had fewer “this can’t be right” moments.

Gemini:

Gemini’s feedback was not as vain as ChatGPT’s but limited in creativity, and had I followed, I doubt it would have made my work drastically different. Gemini gave me a glimpse into what a good LLM can do, yet left me desiring more.

Web search

ChatGPT *:

The web search was genuinely good. It started with my initial direction and covered a good ground of internet knowledge. The output contributes new insights that didn’t exist in my initial notes while managing to stay on course with the topic.

Claude:

The web search seemed rather limited. Instead of including novel uses, it reinforced earlier discussions with new data points. More of the same. Good for validation but bad for new insights. Furthermore, Claude’s internal knowledge is limited to Oct 2024, it could list half a dozen acronyms of MCP, except Model Context Protocol!

Gemini *:

Gemini web search was good, it performed on par with ChatGPT, even faster.

Deep thinking

Aka extended thinking. Aka deep research. You know it when you see one.

ChatGPT *:

Before entering the deep thinking mode, it would ask clarification questions. These questions have proven to be very useful in steering the agent in the right direction and saving time. The agent would show the thinking process while it is happening, but hide it once done. Fun fact, ChatGPT used to hide this thinking process completely until DeepSeek exposed it, “nudging” the OG reasoning model to follow suit.

Claude:

Extended thinking was still in preview. It didn’t ask for clarification, instead uncharacteristically (to Claude) jumped head-on into the thinking mode. It also scanned through an excessive number of sources. Compared to the other 2 agents, it didn’t seem to know when to stop. In that process, it consumed an excessive number of tokens. This however is every much characteristically Claude. I am on the Team license, and my daily quota limit was consistently depleted after 3-4 questions. Claude did show and keep displaying the thinking process.

Gemini:

Deep research did not ask clarification questions, it showed the research plan and asked if I would like to modify. Post confirmation, it launched the deep research mode. Arguably not as friendly as ChatGPT’s approach but good for power users. The thinking process of Gemini was the most details and displayed on the most friendly UI among the bunch. However it was very slow. Ironically, Gemini didn’t deal with disconnection very well, even though at that speed, deep research should be an async task.

Context window

ChatGPT: 128-200k

The agent implemented some sort of rolling context window where earlier chat is summarized in the agent’s memory. The token never ran out during my sessions, but this approach is well-known for collapsing context where the summary misses earlier details, and the agent starts to fabricate facts to fill in the gap.

Claude: 200k

Not bad. Yet reportedly the way Claude chunks its text is not as token-efficient as ChatGPT. When the context window exceeded, I had to start a new chat, effectively losing all my previous content unless I made intermediate artifacts beforehand to capture the data. I lived in the constant fear of running out of tokens, similar to using a desktop in electricity-deficient Vietnam during the 90s.

Gemini *: 1M!

In none of my sessions did I manage to exceed this limit. I don’t know if Gemini simply forgets earlier chat, summarizes it, or requires a new chat. Google (via Gemini web search) does not disclose the precise mechanism it chooses to handle extremely long chat sessions.

Human-in-the-loop collaboration

ChatGPT:

ChatGPT did not voluntarily show Canvas when I was on the app but would bring it up when requested. I could highlight any part of the document and ask the agent to improve or explain it. There were some simple styling options. It showed changes in Git diff style. ChatGPT chart was not part of an editable Canvas. Finally, it did not give me a list of all Canvases. They were just buried in the chat.

Claude *:

Claude’s is called Artifacts. Claude would make an artifact whenever it feels appropriate or is requested. Any part can be revised or explained via highlighting. Any change would create a version. I couldn’t edit the artifact directly though. Understandably, there were also no styling options. Not only could Artifact display code, it also executed JavaScript to generate charts, making Claude’s chart the most customizable in the bunch. To be clear, ChatGPT could show charts, it just did so right in the chat as a response.

Gemini:

Gemini’s Canvas copied ChatGPT’s homework, all the good bits and bad bits. I had to summon it, no Canvas list, no versions but I could undo. Chart rendering however was the same as Claude's - HTML was generated. Gemini added slightly more advanced styling options and a button to export to Google Docs. Perhaps it was just me, but I had several Canvas pages that went from being full of content to blank randomly, and there was nothing I could do to fix the glitch. It was madness.

Cross-app collaboration:

ChatGPT:

ChatGPT has some proprietary integrations with various desktop apps but the interoperability is nowhere as extensive as MCP. For example, where the MCP filesystem can edit any files within its scope of permission, ChatGPT can only “look” at a terminal screen. MCP support was announced in April 2025, this might be improved soon.

Claude *:

MCP! Need I say more?

Gemini:

It is a web page, it is a closed garden with only Google Workplace integrations (“only” for many, this might be the number one reason to stick with Gemini). It is also unclear to me if Google’s intended usage for Gemini the LLM is via NotebookLM, the Gemini webpage, or the Gemini widget in other products. MCP support was announced but being a web page, it will be limited to remote MCP servers.

Characteristics

ChatGPT:

A (over)confident colleague. It is designed to be an independent, competent entity with limited collaboration with the user. other than receiving input. Excellent at handling asynchronous tasks. Also known for excessive sycophancy (even before the rollback).

Claude:

A calm, collected intern. It is designed to be a thought partner working alongside the user. Artifacts and MCPs reinforce this stereotype. It works best with internal docs. Web search/extended thinking need much refinement.

Gemini:

An enabling analyst. Gemini’s research plan and co-edit canvas place a lot of power into the hands of its users. It requires an engaging session to deliver a mindblowing result. Also because if you turn away from it, the deep research session might die, and you have to start again.

Putting the “AI Crew” Together

With that comparison in mind, my today AI crew for product management looks like this.

  • Researcher:
    • Web-heavy discovery: ChatGPT
    • Huge bounded internal doc: Claude
  • Risk assessor: Claude (Claude works well with bounded context)
  • Data analyst:
    • Claude with a handful of MCPs
    • Gemini if the data is in spreadsheets
  • Backlog optimizer: Claude with Atlassian MCP
  • Draft writer:
    • ChatGPT for one-shot work and/or if I know little about the topic
    • Claude/Gemini if I’m gonna edit later
  • Editor/Reviewer: Claude Artifacts = Gemini Canvas
  • Publisher: Confluence MCP / Manual save

Some rules of thumb

  • Claude works well with a definite set of documentation. If discovery is needed, ChatGPT and Gemini are better.
  • Between ChatGPT and Gemini, the former is suitable for less technical users who are less likely to engage in a conversation with AI.
  • In some cases, Claude is selected simply because the use case calls for MCP, and it is the only consumer app supporting MCP today. Also because of this, for tasks all 3 LLMs perform equally well, I go with Claude for later interoperability.

Future proof

When I finished the rules of thumb, I had an unsettling feeling, this would not be the end of this. As the agents evolve, and they do - neck-breaking fast, this guide will soon become obsolete. I don’t plan to keep the guide forever correct, nor am I able to. But I can lay out the ground rules to keep reshuffling and reevaluating the agents as new improvements arrive.

  • Periodic review - What was Claude's weakness yesterday might be its strength tomorrow (looking at you, extended thinking). Meanwhile, a model's strength could become commoditized, diluting its specialization value. Set a quarterly review cadence to assess if the current AI assignments still make sense. Even better if the review can be done as soon as a release drops.
  • Maintain a repeatable test - Develop a standardized benchmark that reflects your actual work. Run identical prompts through different models periodically to compare outputs objectively. This preserves your most effective prompts while creating a consistent evaluation framework. Your human judgment of these comparative results becomes the compass that guides crew reorganization as models evolve. If the use case evolves, evolve the test as well.
  • Build platform-independent intermediate artifacts - As workflows travel between AI platforms, establish standardized formats for handoff artifacts (PRDs, market analyses, etc.). This reduces lock-in and makes crew substitution painless. A good intermediate artifact should work regardless of which model produced it or which model will consume it next. In my use case of product management, some of the artifacts are stored in Confluence.


 

Sunday, May 11, 2025

Multi vs Single-agent: Navigating the Realities of Agentic Systems

In March 2025, UC Berkeley released a paper that, IMO, has not been circulated enough. But first, let’s go back to 2024. Everyone was building their first agentic system. I did. You probably did too.

We defined the systems into different maturity levels

Level 1: Knowledge Retrieval - The system queries knowledge sources to fulfill its purpose, but does not perform workflows or tool execution.

Level 2: Workflow Routing - The system follows predefined LLM routing logic to run simple workflows, including tool execution and knowledge retrieval.

Level 3: Tool Selection Agent - The agent chooses and executes tools from an available portfolio according to specific instructions.

Level 4: Multi-Step Agent - The agent combines tools and multi-step workflows based on context and can act autonomously on its judgment.

Level 5: Multi-Agent Orchestrator - The agents independently determine workflows and flexibly invoke other agents to accomplish their purpose.

You might have seen different lists, yet I bet that no matter how many others you have seen, there is a common element: they all depict a multi-agent system as the highest level of sophistication. This approach promises better results through specialization and collaboration.

The elegant theory of multi-agent systems

The multi-agent collaboration model offers several theoretically compelling advantages over single-agent approaches:

Specialization and Expertise: Each agent can be tailored to a specific domain or subtask, leveraging unique strengths. One agent might excel at generating code while another specializes in testing or reviewing it.

Distributed Problem-Solving: Complex tasks can be broken into smaller, manageable pieces that agents tackle in parallel or sequence. By splitting a problem (e.g., travel planning into weather checking, hotel search, route optimization), the system can solve parts independently and more efficiently.

Built-in Error Correction: Multiple agents provide a form of cross-checking. If one agent produces a faulty output, a supervisor or peer agent might detect and correct it, improving reliability.

Scalability and Extensibility: As problem scope grows, it's often easier to add new specialized agents than to retrain or overly complicate a single agent.

The theory maps beautifully to how we understand human teams work: specialized individuals collaborating under coordination often outperform even the most talented generalist. It is every tech CEO’s wet dream: your agent talks to my agents and figures out what to do.

This model has shown remarkable success in some industrial-scale applications:

Then March 2025 landed with a thud!

The Berkeley reality check

A Berkeley-led team assessed the stage of current multi-agent system implementation. They ran seven popular multi-agent systems across over 200 tasks, and identified 14 unique failure modes organized into three categories:

The paper Why Do Multi-Agent LLM Systems Fail? showed that some state-of-the-art systems achieved only 33.33% correctness on seemingly straightforward tasks like implementing Tic-Tac-Toe or Chess games. For AI, this is the equivalent of getting an F.

The paper provides empirical evidence for what many of us have experienced: today's multi-agent implementation is hard. We can’t seem to fulfill the elegant theory outside a demo, with many failures stemming from coordination and communication issues rather than limitations in the underlying LLMs themselves. And if I didn’t make the table above myself, I would think it was a summary of my university assignments.

The disconnection between PR and reality

Looking back, it's easy to see how the industry became so enthusiastic about multi-agent approaches:

  1. Big-cloud benchmarks looked impressive. AWS reported that letting Bedrock agents cooperate raised "marked improvements" in internal task-success and accuracy metrics for complex workflows. I am sure AWS has no hidden agenda here. It is not like they are in a race with anyone, right? Right?
  2. Flagship research prototypes beat single LLMs on niche benchmarks. HuggingGPT and ChatDev each reported better aggregate scores than a single GPT-4 baseline on their chosen tasks. In the same way the show Are You Smarter Than A 5th Grader works. But to be frank, the same argument can be used for the Berkeley paper.
  3. Thought-leaders said the same. Andrew Ng's 2024 "Agentic Design Patterns" talks frame multi-agent collaboration as the pattern that "often beats one big agent" on hard problems.
  4. An analogy we intuitively get. Divide-and-conquer, role specialization, debate for error-catching — all map neatly onto LLM quirks (limited context, hallucinations, etc.). But just with humans, coordination overhead grows exponentially with agent count.
  5. Early adopters were vocal. Start-ups demoed agent-teams creating slide decks, marketing campaigns, and code bases - with little visible human help - which looked like higher autonomy. Till reality’s ugly little details turn this bed of roses into a can of worms.

The Berkeley paper exposed these challenges, but it also pointed toward potential solutions.

Enter A2A: Plumbing made easy

Google's Agent-to-Agent (A2A) protocol arrived in April 2025 with backing from more than 50 launch partners, including Salesforce, Atlassian, LangChain, and MongoDB. While the spec is still a draft, the participation signal is strong: the industry finally has a common transport, discovery, and task-lifecycle layer for autonomous agents.

A2A directly targets 03 of the 14 failure modes identified in the Berkeley audit:

  1. Role/Specification Issues - By standardizing the agent registry, A2A creates clear declarations of capabilities, skills, and endpoints. This addresses failures to follow task requirements and agent roles.
  2. Context Loss Problems - A2A's message chunking and streaming prevent critical information from being lost during handoffs.
  3. Communication Infrastructure - The HTTP + JSON-RPC/SSE protocol with a canonical task lifecycle provides consistent, reliable agent communication.

Several vendors have begun piloting A2A implementations, with promising but still preliminary results. However, it's important to note that quantitative proof remains scarce. None of these vendors has released side-by-side benchmark tables yet—only directional statements or blog interviews. Google's own launch blog shows a candidate-sourcing workflow where one agent hires multiple remote agents, but provides no timing or accuracy metrics. I won’t include names here because I believe we have established earlier that vendor tests can be unpublished, cherry-picked and may have included proprietary orchestration code.

In other words, A2A to agents is like what MCP to tool uses today. And just as supporting MCP doesn’t make the code of your tools better, A2A doesn’t make agents smarter.

What A2A Doesn't Solve

While something like A2A will fix the failures stemming from the lack of a protocol, it cannot fix issues that do not share the same root cause.

  • LLM-inherent limitations - Individual agent hallucinations, reasoning errors, and misunderstandings. Just LLM being LLM.
  • Verification gaps - The lack of formal verification procedures, cross-checking mechanisms, or voting systems. Without verification, you're essentially trusting an AI that thinks 2+2=5 to check the math of another AI that thinks 2+2=banana.
  • Orchestration intelligence - The supervisory logic for retry strategies, error recovery, and termination decisions. Just the other day, Claude was downloading my blog posts, hit a token limit, tried to repeat, continued to hit the token limit, and looped forever.

Those are 11 out of 14 failure modes. These areas require additional innovation beyond standardized communication protocols. Better verification agents, formal verification systems, or improved orchestration frameworks will be needed to address these challenges. I would love to go deeper into the innovation layers required to solve the “cognitive” failure of multi-agent models in another post.

Conclusion

Multi-agent systems are not the universal solution for today’s problem (yet). Many are here and working, but only at a scale where no single-agent approach can reach. These multi-agent systems deliver potential benefits, but at a substantial cost, one that can easily exceed an organization's capacity to invest or find people with the relevant expertise.

Rather than viewing multi-agent as the holy grail to be sought at all costs, we should approach it with careful consideration—an option to explore when simpler approaches fail. The claimed benefits rarely justify the implementation complexity for most everyday use cases, particularly when a well-designed single-agent system with appropriate guardrails can deliver acceptable results at a fraction of the engineering complexity.

Still, the conversation of multi-agent systems will always be around, in increased frequency, especially in light of the arrival of protocols like A2A. MCP is still the hype now, organizations are busy integrating MCP servers and writing their own before turning their attention to the next big thing. A2A could be that. Better agent alignment could also be that. Or it requires a new cognitive paradigm to improve agents’ smartness. Things will take another 6 months to unfold, which is the amount of time since the MCP announcement.

Either way, what’s clear is that it is a long way for agentic systems to hit their theoretical performance plateau.

Sunday, April 20, 2025

Finally, a break

Hi guys,

The news was out last Friday. I am going to leave Parcel Perform. 

For a sabbatical leave. Between May and October.

Sorry for the gasp. You come here for drama.

This has been planned since last year. I originally planned to leave soon after the 2024 BFCM, after the horizontal scalability of the system became a solved problem. I would like to say permanently, but I have learned that nothing ever is - the scalability, not me leaving for good. But then there were such and such issues with our data source (if you know, you know), and AI took the industry by storm. So I stayed. Eventually though, I knew I needed this break.

I have been on this job for almost 10 years. The first line of code was October 2015. I thought I would be done and move on in 5 years! I have been around longer than most furniture in the office and gone through 4 office changes. A decade is indeed a long time. It is a wet blanket that dampens any excitement and buzz that comes out of the work. Things get repetitive. Except for system incidents, I have lost count of how many ways things can combine to blow off. I praise every morning to wake up to no new alert.

When I was 23, I was fired from my job, and I was unemployed for 6 months. More like unemployable. It could have been longer had my savings not gone dry. I wrote, read, cooked, rode, swam, organized events, and lived a different life I didn't know I should. It was the time I needed to recover from depression. It was the best of my life. I want to experience that one more time.

Upon this news, I received some questions, the most common ones are below.

Why are you leaving now? Is something bad happening?

I am still the CTO of Parcel Perform, just on sabbatical leave. And on the contrary, I think this is a good window for me to take a break. The business is in its best shape since the painful layoff in 2022. We are positive about the future, we are actively expanding the team for the first time in 3 years. The Tech Hub, of which I am directly responsible, has demonstrated that in the face of unprecedented incidents, we are resilient, innovative, and get things done. With the multi-cluster architecture and other innovations, we won't face an existential scalability problem for a long time.

In the last 6 months, we have invested in incorporating GenAI into our product. I believe we have the right data, tech stack, and an experiment-focused process, though only time can tell. To be frank, all the fast-paced experiments we are doing, known internally as POC, reminded me of all the things I loved about working here in the early days. Ideas are discussed in a small circle, stretched on a whiteboard, implemented in less than a day, and repeated. It has been so fun recently that I got cold feet. Perhaps I shouldn't take this break yet. But I am not getting younger, I am getting married, and soon will start a family. I won't have time for myself in a long time. It has to be now.

What will you do during the break?

Oh wow I'm gonna play so much computer game, my brain rots. I have been a vivid fan of Age of Empires II since the time there was only one kid in my neighborhood with a computer good enough to play the game. I am an average player, slow even, so perhaps we are looking at more losses than wins. But hey, it builds character.

I will host board game nights here and there. It's another long-lasting hobby of mine, and a perfect social glue for my group of friends. While I am at that, I probably want to up my mocktail game too. My friends are largely in my age bracket, so for the same reasons above, my ultimate goal is to have more quality time.

As far as dopamine goes, that's it. I am not planning for retirement after all. Can't afford that yet.

To be frank, the pretext of this break is that I want to work on my sleep, which has been less than ideal for a long time. I couldn't figure out a single one thing that could improve my sleep so it is gonna be a total revamp. Distance from work. White bed sheet. Sleep hygiene. Gotta catch em all.

I am probably still awake more than 12 hours a day though. I will be reading as much as I can, fiction, non-fiction, and whatnot. Real life is crazy these days; the distinction is getting vague. There are some long Murakami novels I want to go through. I find that reading his works in one go, or at least with a minimal pause, offers the ideal immersive experience.

What I read, I write. I hope I can find an interesting topic to write every month. If you are keeping track, the last few days have been quite productive ;) I am starting my first subtrack AI Crossroads because that's how I feel these days: an important moment in my life, our lives, that I cannot afford to miss. I am excited and confused. I am sure somebody out there is feeling the same. 

And I will pick up Robotics as a new hobby. As GenAI gets "solved", its reach will not stop at the virtual world. Robotics seems to be the next logical frontier where a new generation of autonomous devices crops up and occupies our world. 

Writing about these things, I'm already pumped!

Who will replace your role?

The good thing about cooking up this plan from last year is that I have had plenty of time to arrange for my departure. The level of disruption should be minimal. People won't notice when I am gone and when I am back. Or so I hope.

There isn't a simple 1 to 1 replacement. Parcel Perform is not getting a new CTO, and there will still be a job for me when I am back. Or so I hope.

As a CTO, my work comes in 3 buckets: feature delivery, technical excellence, and AI research.

Feature delivery is where we have the most robust structure. Over the years, we have managed to find and grow a Head of Engineering and two Engineering Managers. The tribe-squad structure is stable. We are getting exponentially better at cross-squad collaboration as well as reorganization. There is a handful of external support ranging from QA, infrastructure, to data and insight to ensure the troops have the oophm they need to crush down deliveries.

Technical excellence means making sure Parcel Perform tech stack stays relevant for the next 5 years. This is an increasingly ambitious goal. Our tech stack is no longer a single web server. The team is growing. The world is moving faster. But we have 4 extremely bright Staff Engineers. They each have spent years at the organization, are widely regarded for the depth of their technical knowledge, and are definitely better than me on my best day in their field of expertise. We have spent the last couple of months aligning their goals with the needs of Parcel Perform. The alignment is at its strongest point ever since we adopted a goal-oriented performance review system.

Lastly, AI research is the preparation of the organization for the AI future, across technologies, processes, and strategic values. While I will continue the research in my own time, there is now a dedicated AI team that has been made the center of Parcel Perform's AI movement. Despite the humble beginning, the team is 2x their size in the coming months and won't let us "go gentle into that good night" that is our post-apocalyptic lives with the AI overlords.

I think we are in good hands.

What will you do when you come back?

Honest answer, I don't know.

Also honest answer, I don't think it gonna be the same as what I am doing today. Sure, some aspects gonna be the same. 5 months isn't that long. Neither is it short. The organization will continue to grow and evolve to meet the demand of the market, the gap I leave will be filled. When I am back, the organization will undoubtedly be different from what it is today. I will have to relearn how to operate effectively again. I will need to identify the overlap between my interests, my abilities, and the needs of the new Parcel Perform. 

Final honest answer, I am anxious for that future, and it is the best part.

AI is an insult to life itself. Or is it?

The Quote That's Suddenly Everywhere

"AI is an insult to life itself."

I only stumbled across this quote a couple of months ago, attributed to Hayao Miyazaki - the man behind the famed Studio Ghibli. Since then, I've noticed it's literally everywhere, particularly as we speak, there's a massive trend of people using ChatGPT and other image generators to create pictures in the distinctive Ghibli aesthetic. My social feeds are flooded with AI-crafted images of ordinary scenes transformed with that unmistakable Ghibli magic - the soft lighting, the whimsical elements, the characteristic designs, the stuff of my childhood. You know what, I am not good with this kind of words. Here is one from your truly.

The trend has brought Mr Miyazaki's quote back into the spotlight, wielded as a battle cry against this specific type of AI creation. There's just one small problem - this quote is being horribly misused.

I got curious about the context (because apparently I have nothing better to do than fact-check AI quotes when I should be testing that MCP server), so I dug deeper. Turns out, Mr Miyazaki wasn't condemning all artificial intelligence. He was reacting to a specific demonstration in 2016 where researchers showed him an AI program that had created a disturbing, headless humanoid figure that crawled across the ground like something straight out of a horror movie. For God's sake, the animation reminded him of a disabled friend who couldn't move freely. Yeah, I quite agree, that was the stuff of visceral nightmare.

Src: https://www.youtube.com/watch?v=ngZ0K3lWKRc&t=3s

It's also worth noting that the AI Mr Miyazaki was shown in 2016 was primitive compared to today's models like ChatGPT, Claude, or Midjourney. We have no idea how he might react to the current generation of AI systems that can create stunningly convincing Ghibli-style imagery. His reaction to that zombie-like figure doesn't necessarily tell us what he'd think about today's much more advanced and coherent AI creations. Yet the quote lives on, stripped of this crucial context, repurposed as a condescending umbrella on all generative AI.

The Eerie Valley of AI Art

Here's where it gets complicated for me. When I look at these AI-generated Ghibli scenes, they instantly evoke powerful emotions - nostalgia, wonder, warmth - all the feelings I've associated with films like "Spirited Away" or "Princess Mononoke" over years of watching them (for what it's worth, not a big fan of Totoro, it's ok). The visual language of Ghibli taps directly into something deep and meaningful in my experience.

That is what art does. That is what magic does. But this is not, isn't it? These mass-produced imitations feel like they're borrowing those emotions without earning them. I feel an unsettling hollowness to the "art" - like hearing your mother's voice coming from a stranger. The signal is correct, but the source feels wrong.

I'm confronted with a puzzling contradiction: if a human artist were to draw in the Ghibli style (and many talented illustrators do), I wouldn't feel nearly the same unease. Fan art is celebrated, artistic influence is natural, and learning by imitation is as old as art itself. So why does the AI version feel different?

So while the quote is misused, I wonder if Mr Miyazaki's statement might contain an insight that applies here too. These AI creations, in their skillful but soulless imitation, do feel like a kind of insult, not to life itself perhaps, but to the deep human relationship between artist, art, and audience that developed organically over decades.

The Other Side

As a human, I consume AI features. Yet as an engineer, I build AI features. And there is another side to this.

You probably have heard it. Every time on the Internet there is a complaint that the US' gun culture is a nut case, there is a voice from a dark and forgotten 4chan corner screaming back "A gun is just a tool. A gun does not kill people. People do. And if you take the gun away, they will find something else anyway".

But this argument increasingly fails to capture the reality of AI image generators. These systems aren't neutral tools - they've been trained on massive datasets of human art, often without explicit permission from the artists. When I prompt an AI to create "art in Ghibli style," I'm not merely using a neutral tool - I'm activating a complex system that has analyzed and learned from thousands of frames created by Studio Ghibli artists.

This is fundamentally different from a human artist studying and being influenced by Mr Miyazaki's work. The human artist brings their own lived experience, makes conscious choices about what to incorporate, and adds their unique perspective. The AI system statistically aggregates patterns at a scale no human could match, without discernment, attribution, or compensation.

I've built enough software systems to know that complexity breeds emergence. When algorithms make thousands or millions of decisions across vast datasets, the traditional model of direct human control becomes more of a fiction than a reality. You can't just look at the code and know exactly what it will do in every situation. Trust me, I've tried to "study" deep learning.

Perhaps most significantly, as these systems advance, the distance between the creator's intentions and the system's outputs grows. The developers at OpenAI didn't specifically write code that says "here's exactly how to draw a flattering image of a dude taking note on a motorcycle" - they created a system that learned from millions of images, and now it can generate Ghibli-style art that no human specifically programmed it to make. These AI systems develop abilities their creators didn't directly put there and often can't fully predict. This expanding gap between intention and outcome makes the "tools are neutral" argument increasingly unsatisfying.

This isn't to say humans have lost control entirely. Through system design, regulation, and deployment choices, we retain significant influence. But the "tools are neutral" framing no longer adequately captures the complex, bidirectional relationship between humans and increasingly sophisticated AI.

Why We Can't Resist Oversimplification

So far, there are two camps. The "AI is an insult" reflects people whose work and lives are negatively impacted. And the "Tools are neutral" defends AI creators. I tried, but I am sure I have done a less-than-stellar job capturing the thought processes from both camps. Still, as poorly as it is, it feels fairly complex. This complexity is exactly why we humans lean toward simplified narratives like "AI is an insult to life itself" or "It's just a tool like any other." The reality is messy, contradictory, and doesn't fit neatly into either camp.

Humans are notoriously lazy thinkers. I know I am. Give me a simple explanation over a complex one any day of the week. My brain has enough to worry about with keeping our production systems alive. I've reached the point where I celebrate every morning that the #system-alert-critical channel has no new message.

This pattern repeats throughout history. Complex truths get routinely reduced to easily digestible (and often wrong) summaries. Darwin's nuanced theory of evolution became "survival of the fittest." Einstein's revolutionary relativity equations became "everything is relative." Nietzsche's exploration of morality became "God is dead." In each case, profound ideas were flattened into bumper sticker slogans that lost the original meaning. Make good YouTube thumbnails though.

This happens because complexity requires effort. Our brains, evolved for quick decision-making in simpler environments (like not getting eaten by tigers), naturally gravitate toward cognitive shortcuts. A single authoritative quote from someone like Mr Miyazaki provides an easy way to validate existing beliefs without engaging with the messier reality.

There's also power in simple narratives. "AI threatens human creativity" creates a clear villain and a straightforward moral framework. It's far more emotionally satisfying than grappling with the ambiguous benefits and risks of a transformative technology. I get it - it's much easier to be either terrified of AI or blindly optimistic about it than to sit with the uncertainty.

I am afraid that in the coming weeks and months, we cannot afford such simplification.

The choice we have to make

Young technologists today (myself included) find ourselves in an extraordinary position. We're both consumers of AI tools created by others and creators of systems that will be used by countless others. We stand at the edge of perhaps the most transformative wave of innovation in human history, with the collective power to influence how this technology shapes our future and the future of our children. FWIW, I don't have a child yet, I like to think I will.

The questions raised by AI-generated Ghibli art - about originality, attribution, the value of human craft, the economics of creation - aren't going away. They'll only become more urgent as these systems improve and proliferate.

The longer I work in tech, the more I realize that the most important innovations aren't purely technical - they're sociotechnical. Building AI systems that benefit humanity requires more than clever algorithms; it requires thoughtful consideration of how these systems integrate with human values and creative traditions.

For those of us in this pivotal position, neither absolute rejection nor blind embrace provides adequate guidance. We will need to navigate through this, hopefully with better clarity than Christopher Columbus when he "lost" his way to discovering America. My CEO made me read AI-2027 - there is a scenario where humans fail to align AI superintelligence and get wiped out. Brave new world.

1. Embrace Intentional Design and Shared Responsibility

We need to be deliberate about what values and constraints we build into creative AI systems, considering not just what they can do, but what they should do. This might mean designing systems that explicitly credit their influences, or that direct compensation to original creators whose styles have been learned.

When my team started writing our first agent, we entirely focused on what was technically possible. Is this an agent or a workflow? Is this a tool call or a node in the graph? Long context or knowledge base? I know, technical gibberish. The point is, we will soon evolve past that learning curve, and what comes next is thinking through the implications.

2. Prioritize Augmentation Over Replacement

The most valuable AI applications enhance human creativity rather than simply mimicking or replacing it. We should seek opportunities to create tools that make artists more capable, not less necessary.

When I see the flood of AI-generated Ghibli art, I wonder if we're asking the right questions. The most exciting creative AI tools don't just imitate existing styles - they help artists discover new possibilities they wouldn't have found otherwise. The difference between a tool that helps you create and one that creates instead of you may seem subtle, but it's profound. 

I have been lucky enough to be part of meetings where the goal of AI agents is to free the human colleagues from boring, repetitive tasks. I sure hope that trajectory continues. Technology should serve human values, not the other way around.

3. Ensure Diverse Perspectives and Continuous Assessment

The perspectives that inform both the creation and governance of AI systems should reflect the diversity of populations affected by them. This is especially true for creative AI, where cultural context and artistic traditions vary enormously across different communities.

It's so easy to build for people in my immediate circle and call it a day. As an Asian, I see how AI systems trained predominantly on Western datasets create a distorted view of creativity and culture. Without clarification, a genAI model would assume I am a white male living in the US. Bias, prejudice, stereotype. We have seen this.

Finding My Way in the AI Landscape

The reality of our relationship with AI is beyond simple characterization. It is neither an existential threat to human creativity nor a neutral tool entirely under our control. It represents something new - a technology with growing capabilities that both reflects and reshapes our creative traditions.

Those Ghibli-style images generated by AI leave me with mixed feelings that I'm still sorting through. On one hand, I'm amazed by the technical achievement and can't deny the genuine emotions they evoke. On the other hand, I feel I am being conditioned to feel that way.

Perhaps this ambivalence is exactly where we need to be right now - neither rejecting the technology outright nor embracing it uncritically, but sitting with the discomfort of its complexity while we figure out how to move forward thoughtfully.

For our generation that will guide AI's development, the challenge is to move beyond reductive arguments. Neither blind techno-optimism nor reflexive technophobia will serve us well. Instead, we need the wisdom to recognize both the extraordinary potential and the legitimate concerns about these systems, and the courage to chart a course that honors what makes human creativity valuable in the first place.


This post was written with assistance from Claude, which felt a bit meta given the topic. They suggested a lot of the structure, but all the half-baked jokes and questionable analogies are mine alone. And it still took me a beautiful Saturday to pull everything together in my style.


Sunday, April 6, 2025

MCP vs Agent

On 26th Nov 2024, Anthropic introduced the world to MCP - Model Context Protocol. Four months later, OpenAI announced the adoption of MCP across its products, making the protocol the de facto standard of the industry. Put it another way, a few months ago, we were figuring out when something is a workflow and when it is an agent (and we still are). Today, the question is how much MCP Kool-Aid we should drink.

What is MCP?

MCP is a standard to connect AI models and external data sources or tools. This allows models to integrate with external systems independently of platform and implementation. Before MCP, there is tool use, but the integration is platform-specific, be it LangChain, CrewAI, LlamaIndex, and whatnot. An MCP Postgres server, however, works with all platforms and applications supporting the protocol. MCP to AI is HTTP to the internet. I think so, I was a baby when HTTP was invented. 

I won't attempt to paraphrase the components of MCP; modelcontextprotocol.io is dedicated to that. Here is a quick screenshot.

If you have invested extensively in tool use, the rise of MCP doesn't necessarily mean that your system is obsolete. Mind you, tool use is probably still the most popular integration method out there today, and can be made MCP-compatible with a wrapper layer. Here is one from LangChain.

A system with both tool use and MCP looks something like this.

For a 4-month-old protocol, MCP is extremely promising, yet far from the silver bullet for all AI integration problems. The most noticeable problems MCP has not resolved are:

  • Installation. MCP servers run on local machines and are meant for desktop use. While this is okay-ish for professional users, especially developers, it is unsuitable for casual use cases such as customer support.
  • Security. This post describes "tool poisoning" and it is definitely not the last.
  • Authorization. MCP's OAuth 2.1 adoption is a work in progress. OAuth 2.1 itself is still a draft.
  • Official MCP Registry. As far as I know, there are attempts to collect MCP servers, they are all ad-hoc and incomplete, like this and this. The quality is hit and miss, official MCP servers tend to fare better than community ones.
  • Lack of features. Streaming support, stateless connection, proactive server behavior, to name a few.
None of these problems indicates a fundamental flaw of MCP and in time all will be fixed. As with any emerging technology, it is good to be optimistically cautious when dealing with these hiccups.

MCP in SaaS

I work at a SaaS company, building applications with AI interfaces. I was initially confused by the distinction between agent and MCP. Comparing agent to MCP is apple to orange, I know. MCP is more on par with tool use (on steroids). Yet the comparison makes sense in this context: if I only have so many hours to build assistance features for our Customer Support staff, should I build a proprietary agent or an MCP server connecting to, say, Claude Desktop App, assuming both get the work done? Perhaps at this point, it is also worth noting that I believe office workers in the future will spend as much time on AI clients like Claude or ChatGPT as they do on browsers today. If not more. Because the AI overlord doesn't sleep and neither will you!

Though I ranted about today's limitation of MCP, the advantages are appealing. Established MCP clients, like Claude Desktop App, handle gnarly engineering challenges such as output token pagination, human-in-the-loop, or rich content processing with satisfying UX. More importantly, a user can install multiple MCP servers on their device, both in-house and open-sourced, which opens up various workflow automation possibilities.

When I build agents, I noticed that a considerable amount of time went to plumbing - overhead tasks required to get an agent up and running. It couldn't be overcome with better boilerplate code, but still... An agent is also a closed environment where any non-trivial (and sometimes trivial) changes require code modification and deployment. The usability is limited to what the agent's builder provides. And the promise of multi-agent collaboration, even though compelling, has not quite been delivered yet. Finally, an agent is as smart as the system prompts it was given. Bad prompts, like the ones made by yours truly, can make an agent perform far worse than the Claude Desktop App.

Then why are agents not yesternews yet? As far as I know, implementing an agent is the only method that provides full behavior control. Tool call (even via MCP) in an agent is deterministic, and everything related to agent reasoning can be logged and monitored. Claude Desktop App, as awesome as it is, offers little explanation of why it does what it does (though that too can be changed in the future). Drinking too much MCP Kool-Aid could also mean giving too much control of the system to third-party components - the MCP clients - and can lead to an existential threat.   

Conclusion

The rise of MCP is certain. Just as MCP clients might replace browsers as the most important productivity tool, at some point, vendors might cease asking for APIs and seek MCP integration instead. Yet it will not be a binary choice between MCP servers and agents. Each offers distinct advantages for different use cases.
  • MCP allows cooperating a large number of tools, making it suitable for internal dog flooding and fast exploration. MCP would also see more adoption among professional users than casual ones.
  • Agents being more controlled, tamed, and preferably well tested would be the default choice for customer-facing applications.
Within a year, we'll likely see a hybrid system where MCP and agent-based approaches coexist. Of course, innovations such as AGI or MCP going beyond its initial local and desktop use can change this balance. There is no point in predicting long-term future at this point.

Tuesday, February 25, 2025

The Book Series - Management 3.0



* Management 1.0 is traditional top-down decision-making.
* Management 2.0 improves upon traditional management with agile principles adoption and continuous improvement.

That, ladies and gentlemen, is the summary I never remember when I think of this book. 

I love this book. Jurgen Appelo is a great speaker and a greater writer. He is characteristically blunt. One of the chapters goes along the lines of: a manager delegates to his team members is not an art of altruism, he does it so that he has the time to do things he wants to - and proceed to retreat to his ivory tower. , Though he probably would prefer "Dutch frankness" instead. He tells jokes with an emotionless straight face. My kind of joke.

Management 3.0 laid out a complete framework to revamp leadership. There were 5-6 principles built around some other theories, like complexity, game, and network. There is more to that half-ass summary, the details are just a Google away if you are interested. What makes this book unique to me is actually the discussion of management theories before diving into good/bad practices. Each principle is split into two chapters, one on the theory and the other on the execution. Other works in this genre tend to be more dogmatic if I could say so.
 
I am no expert in complexity science. There is a non-zero chance all this theory inclusion is pseudo-science. But I like the attempt and pseudo or not, there is a framework to cling to when in doubt. And there will be a lot of doubts. Management-books-praise-human-ability-to-self-organize-but-that-doesn't-seem-to-cover-my-staff-inertia-to-stay-put kind of doubt. To be frank, the stark contrast between fiction and non-fiction typically comes with management literature exists here as well. There is no unicorn.

Management 3.0 since its debut a decade ago has exploded in popularity though. There are training courses, consultant programs, and god damned certifications. It is going down the dark commercial beaten path all so well-known Agile has trodden before. And that dented my heart a bit. But hey one can only retreat to his ivory tower with a fat bank account.

I still think you should give it a try though.

Sunday, February 23, 2025

The sun has set on MultiUni

Facebook will shut down the MultiUni page in a few days due to inactivity. That is generous because I don't think anything of value was posted to the page in the last decade. MultiUni ceased to exist long before this day, like my social life.

MultiUni was founded in 2009 by Huy Zing. Huy was my lecturer at RMIT. He understood the hurdles one had to jump through to open a course in an institute like a university because he had never succeeded in doing so. And he understood that formal education was not suitable for everyone because he was more of a software engineer than a lecturer. I assisted in a couple of courses. Though I have never seen a mission statement of MultiUni but if there was, it probably read like this: to bring the latest and most important technology courses to as many people in the local community as possible. We did exclusively in-person classes. We also did it for free. Pretty sure that was why I ended up writing this sunset post.

Free in-person tech courses. That might sound like a strange idea today but the world was also stranger back then. Open courses from the likes of MIT, Stanford, or Harvard were still in the infant stage. Coursera was not founded until 3 years later in 2012. English was a bigger barrier than it is today. Not everyone was social media native, I remember we had to discuss whether MultiUni on Facebook, yeah the one being shut down, should be a page or a group.

The first course was #iphonedev. The app store debuted the year before and mobile app development was all the rage. There was no iOS, only iPhone. There was no Swift, only ObjectiveC. There were few Macs, but plenty of hackintosh wannabes. Huy taught this class himself. Phu and I were his assistants. We would read the material the week before, do the exercise, internalize it, and bring it to the class the week after. Well, at least ideally. There were times we started reading the material in the morning of the evening class. I met Kien in that class who eventually ended up becoming a mobile developer for no less than Google Singapore.

The second course was probably an Android one. It felt like a compulsory follow-up to the iPhone course. The Yin and the Yang. Probably, because I didn't stand that class. Huy met Binh online, whom I was told was among the first self-taught Android developers in Saigon those days, and convinced Binh a community course was the next best thing he could do. Binh went on and hit many high notes in his career, Huy was probably right.

Then my memory got blurry. School got crazier. I ran my own club. Huy returned to the US at some point. Between here and there there might be a couple more courses but I couldn't really tell. In 2012, Phu and I (and many others) graduated and both worked at Cogini Inc. We thought it would be cool to bring the course back, so we made Web Tech 141. It was a 5-week course where each week we went through one web framework in a different programming language. PHP. Java. Roby. Python. I couldn't recall the last one for the sake of my life. Must be something not great. Erlang perhaps? I met An in that class, he is in the US now. And that was the last of it. 2012 was the year of Coursera, Khan Academy, and many more. Virtually every university worth its salt published its courses' recordings and material online. MultiUni sits in a corner of my profile to this day though I don't know how many still remember it.

The timing of this sunset is uncanny. Mobile development took the world by storm. It indeed changed the world in many ways better than Blockchain would ever do. Today GenAI is making a storm 10x the size. Vietnam in the late 2000s had a shortage of training materials and exposure. People shared what they knew more... primitively, for lack of a better word. What I meant was the organization was stretchy, and the presentation was dull even by the standard back then. Yet there was no hidden agenda, there were definitely personal interests but rarely a monetary one, and attempts to spread knowledge seemed genuine. Today, if someone puts a tutorial or a course online for free, you better check if it is to hype up a particular technology, an attempt to demonstrate the influence of a person in his/her little bubble of the Internet, or a demo version of consultancy or paid courses. Speaking of paid courses, MOOCs enable training on virtually any topic one can think of, good or bad. In fact, because it is easier to make bad content rather than good one, MOOCs are diamonds mixed in land mines. There are more hidden agendas. Or it could be that the world has always been like that, the people have always been like that, I was just naive. I wouldn't disagree, I was technically a teenager in the first MultiUni class, what did I know. And I would never know. No one baths in the same river twice. MultiUni is an obsolete invention, a stepping stone no longer needed by its environment, and a cache of fond memories in my coming-of-age journey.

I dug this logo up from a slide. That is everything I have left. 

P/S introducing Huy as one of my lecturers doesn't do him justice. Huy taught me a mere course out of 24 or so courses at RMIT. But he gave me my first job, exposed me and a bunch of others in my generation to the infectious Silicon Valley startup mindset, and started MultiUni and Barcamp Saigon both of which I ran for a while. Who I am today is the realization of the vision given by Huy in those early days. For this, I am forever grateful.