Monday, July 7, 2025

AI's Impact on Developer Productivity

May was an exciting month for Tech. There was Google I/O, where Google Glass tried to make a comeback. There was Microsoft Build, which, to be honest, I watched for the first time in a while. I am keeping an eye on NLWeb. I almost missed LangChain Interrupt. But my favorite one is Code With 

  • Claude, Anthropic’s first developer conference.
  • Claude Opus and Sonnet 4 were released.
  • Claude Code was available on VS Code (and its forks) and JetBrains IDEs.

Essentially all IDEs out there now have access to the same state-of-the-art model and coding agent. So what does this mean for us developers?

Let’s examine the code generation landscape.

Levels of use case complexity

The field of automatic code generation is exploding. Take Cursor for example, its features include Tab, Ctrl+K, Chat, and Agent. They all generate code in some shapes and forms. They also serve vastly different use cases to the extent that it is really awkward to use one feature for another’s use case. The abundance of variation means “developers who use AI are more productive” makes as much sense as announcing “mathematicians with calculators are better than those without.”

Aki Ranin made a framework to categorize the sophistication of AI agents we’ll be interacting with.

An agent starts with a low level of autonomy. It plays a reactive role, responding to human requests. It gradually becomes more active, responds to system events, and might or might not require a human supervisor. The number of actions it can take and the creativity level of its solution are still rather limited. Finally, the agent takes on a human-like role and within its boundary can handle a task end-to-end.

Mapping that to the features of Cursor and other AI-assisted IDEs, I categorize AI support for developers into 4 levels.


Level 1: Foundational Code Assistance: Characterized by real-time, localized suggestions with minimal immediate context. The interaction is primarily reactive: the developer types, the AI suggests, and the developer accepts, rejects, or ignores the suggestion. Autonomy is low, relying heavily on pattern matching.

Level 2: Contextual Code Composition and Understanding: AI tools at this level utilize broader file or local project context and engage in more interactive exchanges. They can generate larger code blocks, such as functions or classes, and perform basic code understanding tasks. Developers typically provide prompts, comments, or select code for AI action.

Level 3: Advanced Co-Development and Workflow Automation: These AI systems exhibit deep codebase awareness, potentially including multi-file understanding. They can automate more complex tasks within the Software Development Life Cycle, assisting in intricate decision-making. The developer delegates specific, bounded tasks to the AI.

Level 4: Sophisticated AI Coding Agents and Autonomous Systems: This level represents high AI autonomy, including the ability to plan and execute multi-step tasks towards end-to-end completion. These systems can interact with external tools and environments, requiring minimal oversight for defined goals. The developer defines high-level objectives or complex tasks, which the AI agent then plans and executes, with the developer primarily reviewing and intervening as necessary.

Impact on Developer Productivity

Measuring developer productivity is a notorious topic. Take all the metrics below with a heavy grain of salt.

The bright side

Level 1: Foundational Code Assistance

Most developers actually like this level: it handles the boring stuff without getting in the way. Many find these tools "extremely useful" for such scenarios, appreciating the reduction in manual typing.

GitHub, in its own paper "Measuring the impact of GitHub Copilot", boasts a 55% faster task completion rate when using "predictive text". GitHub obviously has the incentive to be a bit liberal on how this was measured. It's like asking a barber if you need a haircut. Independent studies suggest the real number is more 'your mileage may vary' than 'rocket ship to productivity paradise.' Researches from Zoominfo and Eleks say the number in practice is closer to 10-15%.

Level 2: Contextual Code Composition and Understanding

At this level, not only does the AI assistance generate code (bigger than Level 1), it also provides the utility for learning and code comprehension.

As AI generates larger and more complex code blocks, the perception is also more mixed. Typical complaints are inconsistencies in the output quality and almost-correct code ultimately taking more time to modify than writing new. It's the uncanny valley of code generation, close enough to look right, wrong enough to ruin your afternoon. Code comprehension, fortunately, enjoys a more universal positive feedback in being valuable for grasping unfamiliar code segments or new programming concepts.

On one hand, IBM's internal testing of Watsonx Code Assistant projected substantial time savings: 90% on code explanation tasks and a 38% time reduction in code generation and testing activities. On the other hand, a study focusing on Copilot found that only 28.7% of its suggestions for resolving coding issues were entirely correct, with 51.2% being somewhat correct and 20.1% being erroneous. This is what we continue to observe as the level of autonomy increases. Welcome to the future. It's complicated.

Level 3: Advanced Co-Development and Workflow Automation

This level of AI assistance is characterized by a multi-step thinking process and multi-file context. At this point, AI starts feeling less like a tool and more like that overachieving colleague who reorganizes the entire codebase while you're at lunch. Helpful? Yes. Slightly terrifying? Also yes.

Though we continue to see the correlation between high autonomy and high failure rate as we do in Level 2, Level 3 is where we start to see a new class of AI-first projects. These are projects planned specifically to incorporate AI capacity into their development life cycle. For example: API convention enforcement in CICD, automated test generation, and feature customization within distinct bounded contexts. There is a clear appreciation for the automation of time-consuming tasks.

Given the wide range of AI-assisted workflows, concrete productivity gains are harder to find in research. However, the adoption of Level 3 AI capabilities is undoubtedly growing. The 2024 Stack Overflow Developer Survey revealed that developers anticipate AI tools will become increasingly integrated into processes like documenting code (81% of respondents) and testing code (80%). GitHub's announcement of Copilot Code Review highlighted that over 1 million developers had already used the feature during its public preview phase. Level 3 resonates well with the "shift left" paradigm in software development.

Level 4: Sophisticated AI Coding Agents and Autonomous Systems

Level 4 is the holy grail of autonomous AI agents. Claude Code is a CLI tool, you chat with it, but you don’t code with it. Devin goes a step further, you chat with Devin through Slack.

There is an overall excitement about the future of software development with the arrival of these autonomous agents. However, Level 4 agents are still in their early days. Independent testing by researchers at Answer.AI painted a more sobering picture: Devin reportedly completed only 3 out of 20 assigned real-world tasks.

Some other Level 4 agents demonstrate impressive capabilities on benchmarks like SWE-Bench. Claude Opus 4 got 72.5%, which is probably higher than mine. Yet their application to real-world complex software development tasks reveals a significant "last mile" problem. Outside of the controlled environments, these agents often struggle with ambiguity, unforeseen edge cases, and tasks that require deep, nuanced human-like reasoning or interaction with poorly documented, unstable, or unpredictable external systems.

The following table provides a comparative overview of the four AI usage levels, summarizing key characteristics, developer perceptions, productivity impacts, common challenges, and adoption insights.

The other side

I wouldn’t necessarily refer to this section as “the dark side”. I don’t think the AI future is apocalyptic. However, there is plenty of evidence that the integration of AI into the fabric of software engineering is not rosy. You probably have heard about Klarna.

Running a team of developers, I see there is another challenge in utilizing AI for productivity gain: pushback from developers.

Poor code quality

GitClear tore apart GitHub’s own “55% faster” study and traced large spikes in bugs, rewrites, and copy-pasted blocks associated with automatic code generation. The study projected that "code churn", the percentage of code discarded less than two weeks after being written, would double, suggesting that AI-generated code often requires substantial revisions.

For my team, we can only use around 40-50% of the generated code. Though of course there is a matter of better prompt and context, we see that some code blocks are sub-optimal or almost correct. Sub-optimal code is still functional, but unlikely to make it pass code review. Almost-correct code is far worse. Rewriting a piece of almost-correct code takes longer than writing it from scratch. If it sneaks past code review, it becomes a production bug.

While AI is writing code at a superhuman speed, it is mounting tech debts just as fast.

Over-reliance and potential deskilling

Beyond code quality, reliance on AI outputs can diminish cognitive engagement, erode essential problem-solving skills, and weaken the deep understanding of core coding principles. As one developer put it, "writing code yourself remains important because writing code is not just writing code: it's an organic process, which familiarizes you with the language, which gives you the philosophy of the language". Failure to develop this intimate knowledge of the languages and system designs will hinder career development.

Furthermore, many developers reported that they shifted their focus from creative problem-solving to merely verifying AI outputs. The thing is, many got into programming not because of the software business but because they found the act of programming a fulfilling experience. They don’t just focus on shipping the code, they also enjoy the journey of getting there. Downgrading that experience to babysitting an AI agent deprives them of the job satisfaction.

“Vibe coding” was coined in February 2025. I don’t think there has been enough time for people to lose their programming skills, but can confirm that the underlying fear of becoming "passive participants" and eventually redundant in the coding process prevents some from fully embracing the advantages of AI.

Context limitations

This last challenge is not technosocial, it is pure technical. Today's AI coding assistants, while powerful coding assistants, face significant context limitations that deter full adoption in software development. Their finite context windows mean they're constantly forgetting what you told them five minutes ago. It's like trying to build a house with a contractor who has amnesia, every morning you have to re-explain why the bathroom shouldn't be in the kitchen. These systems struggle with large codebases or numerous interconnected modules, and they get confused as the number of tools and MCPs increases. Progress.

Look, managing all this context is a pain in the ass. You're constantly iteratively refining prompts and working around these inherent technological drawbacks. Consequently, some prefer to await more mature AI solutions that can seamlessly handle large-scale context and maintain conversational memory, rather than investing significant effort in mastering the current generation's limitations.

The Pragmatic Path Forward

Great things may come to those who wait, but only the things left by those who hustle.

LLM is the closest we have ever gotten to AGI. It is coming not like a wave but a tsunami. It will sweep away people trying to go against the current of the age. And for that, we must not wait. Skills and experiences come with being in the field, getting to know all the moving parts, and being ready for the latest additions.

There are the usual sayings of “treating AI like a junior pair programmer” and “learning the art of writing clear, concise, and context-rich prompts”, which I think are definitely important. But they further emphasize the notion that the future of software development is a barren land, hyper-focused on productivity and deprived of creative joy. But what is work without the enjoyment?

Build your AI crew

Programming is changing, and with it, developers like me need to adapt. I got into programming because I like to build things, not get things built (there is a slight difference there on which my existence depends). I like to get into the zone for complex problem-solving, (criticizing) system architecture, and figuring out what new technologies mean for the business. But I also have to admit that after 15 years, I dread building the next login page, HTML email template, or coding standard. The goldilock for me has always been to figure out a way to keep the work I like for myself and push everything else away, to other humans or machines alike.

Delegating to junior developers means defining well-scoped tickets, giving them the hands-on experience (plus the chance to fail), and growing their technical skills. It takes multiple AI agents to be remotely comparable to a single junior developer. Their context is also limited, not enough to handle an end-to-end task. Delegating to AI agents, therefore, means tinkering with a multi-agent setup where each concrete task, such as analysing requirements, defining relevant unit tests, or implementing a piece of business logic, is distributed to different agents communicating via some intermediate medium like a document, a spreadsheet, or the code base itself. In a demo at Code With Claude 2025, a sample promoted workflow is to have AI implement a mock UI (mock.png), then give it Puppeteer to take screenshots, and ask it to reiterate till pixel-perfect. Once I get into the gist of it, it is actually less delegating and more system design, building a platform to automate the boring work away.

AI-first projects

I find that sometimes we give AI the most gnarly bits of the system to work on, bits that we don’t even want to look at ourselves, get disappointed at the mess, and declare AI is not ready yet. The last part is correct, they are junior, the eldest is around 3 in human years. Think about what most 3-year-olds do: they put things in their mouth, draw on walls, and occasionally produce something brilliant by pure accident. Sound familiar?

Greenfield projects where the code base is still small, and AI-friendly design principles (such as modularity, clear APIs, good documentation) are easy to enforce play on the strength of AI and negate some of its most significant drawbacks.

Some projects take it a step further. From their inception, certain parts of the software were destined to be written by AI.

The human code orchestrates the service's overall behavior and is kept segregated from the AI code. Think of the strategy or decorator design pattern on steroids.

The AI code is heavily modularized, has a lower standard of code quality, and can be rewritten (or rather, regenerated) at a moment's notice.

Requirements and tests are emphasized because they are the primary deliverable, not the expendable code. Their creation, in turn, can also be AI-assisted as part of the AI crew concept.

We do a fair share of custom HTML email templates for our customers. Crafting HTML and CSS is not the most thought-provoking activity, and the constant back-and-forth checks to support pixel-perfect design across browsers and email clients are nerve-wrecking. Needless to say, turning that module into an AI-first project was enthusiastically supported.

Brownfield projects

Yet not every project can be an AI-first project. Despite constantly adding new services and breaking down old ones, I reckon many of the critical code blocks reside in a handful of projects that have been around since the beginning of time and are lovingly referred to as legacy code.

These projects are challenging even to the best human developers. A single feature might span dozens of files, each with its own decade-old conventions. Every developer who touched it left their mark. Now it's a Frankenstein’s monster of coding styles. And AI output is a function of its input.

We can adopt an incremental approach ("Strangler Fig" Pattern):
  • Before code modification, give the system’s documentation an overhaul. AI can generate initial drafts of documentation and summaries. The effort can be complemented with interviewing long-tenured developers to capture "tribal knowledge" - AI transcripting is particularly helpful. Finish up with a manual validation and refine. A happy side effect is that this logic documentation is also fundamental in building an AI data analysis experience.
  • If the code base is a behemoth, the context window will be a problem, especially with my favorite Claude. Divide the “memory” into tiers. The most basic level is the immediate prompt and the ongoing conversation, which forms the memory of the task at hand. Second is the project memory - claude.md, .cursor/rules, and whatnot - this hosts key decisions, patterns, and learned codebase knowledge. This will be added to the context window of every conversation, so be really selective about what is put in there. The last layer is an overview of the project, but sometimes the entire code base because of intersystem dependencies. This is the cutting edge of automatic code generation at the moment, companies are looking into RAG and GraphRAG solutions to bridge this gap.
  • Once the ground knowledge is there, avoid a "big bang" rewrite. Instead, use AI to help understand and build interfaces and unit tests around specific modules of the legacy system. Gradually replace or refactor these modules, with AI assisting in understanding the old logic and integrating new components.

Conclusion

From 10-15% productivity gains in Level 1 to the promises and pitfalls of Level 4 autonomous agents, one thing is clear: AI is reshaping software development at every level. The path forward isn't about choosing between human creativity and AI efficiency. It's about finding the right blend for your context. There is a whole spectrum to choose from, starting with delegating boilerplate to AI while keeping complex problem-solving for yourself, to architecting AI-first systems, to teaching AI to understand your decade-old codebase. In the end, the developers who thrive will be those who build better systems, be it a code production engine or features. It has always been that way.

Thursday, May 29, 2025

LLM Agents Comparison

 I was designing an AI-powered workflow for Product Management. The flow looks somewhat like this.


The details of the flow are largely not relevant to the topic of this post. The important gist is: it is an end-to-end flow that would be too big to fit in most modern LLMs’ context window. As the flow has to be broken down into sequential phases, it lends nicely to a multi-agent setup with intermediate artifacts to assist communication between one agent to another. It was the AI as Judge part that reminded me, just as a developer cannot reliably test their work, an agent should not be trusted to judge its own work.

"Act as a critical Product Requirement Adjudicator and evaluate the following product requirement for clarity, completeness, user value, and potential risks. [Paste your product requirement here]. Provide specific, actionable feedback to help me improve its robustness and effectiveness."

(A) I have a multi-agent setup. (B) I have a legitimate reason to introduce more than one LLM to reduce model blind spots. The next logical step is to decide which model makes the best agent.

An empirical comparison

While composing the aforementioned workflow, I fed the same prompts to Claude 3.7, Gemini 2.5 Pro (preview), and ChatGPT-o3. All in their native environments (desktop app for Claude and ChatGPT, web page for Gemini). I first asked the agent to passively consume my note while waiting for my cue to participate. For each note, I wrote one use case where I think GenAI is beneficial to the productivity of a product manager. When the source of the idea was from the Internet, I provided a link(s). Once my notes were exhausted, I asked the agent to perform a search to find other novel uses of GenAI in product management that I might have missed. Finally, with the combined inputs from me and web search, I asked the agent to compose a comprehensive report. I ended up with three conversations and three reports. Below are my observations on the pros and cons of each. An asterisk (*) marks the clear winner.

LLM as a thought partner

This is probably the most subjective section of this subjective comparison. Feel free to let me know if you disagree.

ChatGPT:

I didn’t like ChatGPT as a partner. Its feedback sometimes focused on form over substance - whether or not I follow the feedback makes no difference to my work. ChatGPT was also stubborn and prone to hallucination, a dangerous combination. It kept convincing me that Substack supports some form of table, first suggesting a markdown format, then insisting on a non-existent HTML mode.

Claude *:

I dig Claude (with certain caveats explored below). Its feedback had a good hit-to-miss ratio. The future-proof section at the end of this post was its suggestion (the writing mine). There seemed to be fewer hallucinations too. Empirically speaking, I had fewer “this can’t be right” moments.

Gemini:

Gemini’s feedback was not as vain as ChatGPT’s but limited in creativity, and had I followed, I doubt it would have made my work drastically different. Gemini gave me a glimpse into what a good LLM can do, yet left me desiring more.

Web search

ChatGPT *:

The web search was genuinely good. It started with my initial direction and covered a good ground of internet knowledge. The output contributes new insights that didn’t exist in my initial notes while managing to stay on course with the topic.

Claude:

The web search seemed rather limited. Instead of including novel uses, it reinforced earlier discussions with new data points. More of the same. Good for validation but bad for new insights. Furthermore, Claude’s internal knowledge is limited to Oct 2024, it could list half a dozen acronyms of MCP, except Model Context Protocol!

Gemini *:

Gemini web search was good, it performed on par with ChatGPT, even faster.

Deep thinking

Aka extended thinking. Aka deep research. You know it when you see one.

ChatGPT *:

Before entering the deep thinking mode, it would ask clarification questions. These questions have proven to be very useful in steering the agent in the right direction and saving time. The agent would show the thinking process while it is happening, but hide it once done. Fun fact, ChatGPT used to hide this thinking process completely until DeepSeek exposed it, “nudging” the OG reasoning model to follow suit.

Claude:

Extended thinking was still in preview. It didn’t ask for clarification, instead uncharacteristically (to Claude) jumped head-on into the thinking mode. It also scanned through an excessive number of sources. Compared to the other 2 agents, it didn’t seem to know when to stop. In that process, it consumed an excessive number of tokens. This however is every much characteristically Claude. I am on the Team license, and my daily quota limit was consistently depleted after 3-4 questions. Claude did show and keep displaying the thinking process.

Gemini:

Deep research did not ask clarification questions, it showed the research plan and asked if I would like to modify. Post confirmation, it launched the deep research mode. Arguably not as friendly as ChatGPT’s approach but good for power users. The thinking process of Gemini was the most details and displayed on the most friendly UI among the bunch. However it was very slow. Ironically, Gemini didn’t deal with disconnection very well, even though at that speed, deep research should be an async task.

Context window

ChatGPT: 128-200k

The agent implemented some sort of rolling context window where earlier chat is summarized in the agent’s memory. The token never ran out during my sessions, but this approach is well-known for collapsing context where the summary misses earlier details, and the agent starts to fabricate facts to fill in the gap.

Claude: 200k

Not bad. Yet reportedly the way Claude chunks its text is not as token-efficient as ChatGPT. When the context window exceeded, I had to start a new chat, effectively losing all my previous content unless I made intermediate artifacts beforehand to capture the data. I lived in the constant fear of running out of tokens, similar to using a desktop in electricity-deficient Vietnam during the 90s.

Gemini *: 1M!

In none of my sessions did I manage to exceed this limit. I don’t know if Gemini simply forgets earlier chat, summarizes it, or requires a new chat. Google (via Gemini web search) does not disclose the precise mechanism it chooses to handle extremely long chat sessions.

Human-in-the-loop collaboration

ChatGPT:

ChatGPT did not voluntarily show Canvas when I was on the app but would bring it up when requested. I could highlight any part of the document and ask the agent to improve or explain it. There were some simple styling options. It showed changes in Git diff style. ChatGPT chart was not part of an editable Canvas. Finally, it did not give me a list of all Canvases. They were just buried in the chat.

Claude *:

Claude’s is called Artifacts. Claude would make an artifact whenever it feels appropriate or is requested. Any part can be revised or explained via highlighting. Any change would create a version. I couldn’t edit the artifact directly though. Understandably, there were also no styling options. Not only could Artifact display code, it also executed JavaScript to generate charts, making Claude’s chart the most customizable in the bunch. To be clear, ChatGPT could show charts, it just did so right in the chat as a response.

Gemini:

Gemini’s Canvas copied ChatGPT’s homework, all the good bits and bad bits. I had to summon it, no Canvas list, no versions but I could undo. Chart rendering however was the same as Claude's - HTML was generated. Gemini added slightly more advanced styling options and a button to export to Google Docs. Perhaps it was just me, but I had several Canvas pages that went from being full of content to blank randomly, and there was nothing I could do to fix the glitch. It was madness.

Cross-app collaboration:

ChatGPT:

ChatGPT has some proprietary integrations with various desktop apps but the interoperability is nowhere as extensive as MCP. For example, where the MCP filesystem can edit any files within its scope of permission, ChatGPT can only “look” at a terminal screen. MCP support was announced in April 2025, this might be improved soon.

Claude *:

MCP! Need I say more?

Gemini:

It is a web page, it is a closed garden with only Google Workplace integrations (“only” for many, this might be the number one reason to stick with Gemini). It is also unclear to me if Google’s intended usage for Gemini the LLM is via NotebookLM, the Gemini webpage, or the Gemini widget in other products. MCP support was announced but being a web page, it will be limited to remote MCP servers.

Characteristics

ChatGPT:

A (over)confident colleague. It is designed to be an independent, competent entity with limited collaboration with the user. other than receiving input. Excellent at handling asynchronous tasks. Also known for excessive sycophancy (even before the rollback).

Claude:

A calm, collected intern. It is designed to be a thought partner working alongside the user. Artifacts and MCPs reinforce this stereotype. It works best with internal docs. Web search/extended thinking need much refinement.

Gemini:

An enabling analyst. Gemini’s research plan and co-edit canvas place a lot of power into the hands of its users. It requires an engaging session to deliver a mindblowing result. Also because if you turn away from it, the deep research session might die, and you have to start again.

Putting the “AI Crew” Together

With that comparison in mind, my today AI crew for product management looks like this.

  • Researcher:
    • Web-heavy discovery: ChatGPT
    • Huge bounded internal doc: Claude
  • Risk assessor: Claude (Claude works well with bounded context)
  • Data analyst:
    • Claude with a handful of MCPs
    • Gemini if the data is in spreadsheets
  • Backlog optimizer: Claude with Atlassian MCP
  • Draft writer:
    • ChatGPT for one-shot work and/or if I know little about the topic
    • Claude/Gemini if I’m gonna edit later
  • Editor/Reviewer: Claude Artifacts = Gemini Canvas
  • Publisher: Confluence MCP / Manual save

Some rules of thumb

  • Claude works well with a definite set of documentation. If discovery is needed, ChatGPT and Gemini are better.
  • Between ChatGPT and Gemini, the former is suitable for less technical users who are less likely to engage in a conversation with AI.
  • In some cases, Claude is selected simply because the use case calls for MCP, and it is the only consumer app supporting MCP today. Also because of this, for tasks all 3 LLMs perform equally well, I go with Claude for later interoperability.

Future proof

When I finished the rules of thumb, I had an unsettling feeling, this would not be the end of this. As the agents evolve, and they do - neck-breaking fast, this guide will soon become obsolete. I don’t plan to keep the guide forever correct, nor am I able to. But I can lay out the ground rules to keep reshuffling and reevaluating the agents as new improvements arrive.

  • Periodic review - What was Claude's weakness yesterday might be its strength tomorrow (looking at you, extended thinking). Meanwhile, a model's strength could become commoditized, diluting its specialization value. Set a quarterly review cadence to assess if the current AI assignments still make sense. Even better if the review can be done as soon as a release drops.
  • Maintain a repeatable test - Develop a standardized benchmark that reflects your actual work. Run identical prompts through different models periodically to compare outputs objectively. This preserves your most effective prompts while creating a consistent evaluation framework. Your human judgment of these comparative results becomes the compass that guides crew reorganization as models evolve. If the use case evolves, evolve the test as well.
  • Build platform-independent intermediate artifacts - As workflows travel between AI platforms, establish standardized formats for handoff artifacts (PRDs, market analyses, etc.). This reduces lock-in and makes crew substitution painless. A good intermediate artifact should work regardless of which model produced it or which model will consume it next. In my use case of product management, some of the artifacts are stored in Confluence.


 

Sunday, May 11, 2025

Multi vs Single-agent: Navigating the Realities of Agentic Systems

In March 2025, UC Berkeley released a paper that, IMO, has not been circulated enough. But first, let’s go back to 2024. Everyone was building their first agentic system. I did. You probably did too.

We defined the systems into different maturity levels

Level 1: Knowledge Retrieval - The system queries knowledge sources to fulfill its purpose, but does not perform workflows or tool execution.

Level 2: Workflow Routing - The system follows predefined LLM routing logic to run simple workflows, including tool execution and knowledge retrieval.

Level 3: Tool Selection Agent - The agent chooses and executes tools from an available portfolio according to specific instructions.

Level 4: Multi-Step Agent - The agent combines tools and multi-step workflows based on context and can act autonomously on its judgment.

Level 5: Multi-Agent Orchestrator - The agents independently determine workflows and flexibly invoke other agents to accomplish their purpose.

You might have seen different lists, yet I bet that no matter how many others you have seen, there is a common element: they all depict a multi-agent system as the highest level of sophistication. This approach promises better results through specialization and collaboration.

The elegant theory of multi-agent systems

The multi-agent collaboration model offers several theoretically compelling advantages over single-agent approaches:

Specialization and Expertise: Each agent can be tailored to a specific domain or subtask, leveraging unique strengths. One agent might excel at generating code while another specializes in testing or reviewing it.

Distributed Problem-Solving: Complex tasks can be broken into smaller, manageable pieces that agents tackle in parallel or sequence. By splitting a problem (e.g., travel planning into weather checking, hotel search, route optimization), the system can solve parts independently and more efficiently.

Built-in Error Correction: Multiple agents provide a form of cross-checking. If one agent produces a faulty output, a supervisor or peer agent might detect and correct it, improving reliability.

Scalability and Extensibility: As problem scope grows, it's often easier to add new specialized agents than to retrain or overly complicate a single agent.

The theory maps beautifully to how we understand human teams work: specialized individuals collaborating under coordination often outperform even the most talented generalist. It is every tech CEO’s wet dream: your agent talks to my agents and figures out what to do.

This model has shown remarkable success in some industrial-scale applications:

Then March 2025 landed with a thud!

The Berkeley reality check

A Berkeley-led team assessed the stage of current multi-agent system implementation. They ran seven popular multi-agent systems across over 200 tasks, and identified 14 unique failure modes organized into three categories:

The paper Why Do Multi-Agent LLM Systems Fail? showed that some state-of-the-art systems achieved only 33.33% correctness on seemingly straightforward tasks like implementing Tic-Tac-Toe or Chess games. For AI, this is the equivalent of getting an F.

The paper provides empirical evidence for what many of us have experienced: today's multi-agent implementation is hard. We can’t seem to fulfill the elegant theory outside a demo, with many failures stemming from coordination and communication issues rather than limitations in the underlying LLMs themselves. And if I didn’t make the table above myself, I would think it was a summary of my university assignments.

The disconnection between PR and reality

Looking back, it's easy to see how the industry became so enthusiastic about multi-agent approaches:

  1. Big-cloud benchmarks looked impressive. AWS reported that letting Bedrock agents cooperate raised "marked improvements" in internal task-success and accuracy metrics for complex workflows. I am sure AWS has no hidden agenda here. It is not like they are in a race with anyone, right? Right?
  2. Flagship research prototypes beat single LLMs on niche benchmarks. HuggingGPT and ChatDev each reported better aggregate scores than a single GPT-4 baseline on their chosen tasks. In the same way the show Are You Smarter Than A 5th Grader works. But to be frank, the same argument can be used for the Berkeley paper.
  3. Thought-leaders said the same. Andrew Ng's 2024 "Agentic Design Patterns" talks frame multi-agent collaboration as the pattern that "often beats one big agent" on hard problems.
  4. An analogy we intuitively get. Divide-and-conquer, role specialization, debate for error-catching — all map neatly onto LLM quirks (limited context, hallucinations, etc.). But just with humans, coordination overhead grows exponentially with agent count.
  5. Early adopters were vocal. Start-ups demoed agent-teams creating slide decks, marketing campaigns, and code bases - with little visible human help - which looked like higher autonomy. Till reality’s ugly little details turn this bed of roses into a can of worms.

The Berkeley paper exposed these challenges, but it also pointed toward potential solutions.

Enter A2A: Plumbing made easy

Google's Agent-to-Agent (A2A) protocol arrived in April 2025 with backing from more than 50 launch partners, including Salesforce, Atlassian, LangChain, and MongoDB. While the spec is still a draft, the participation signal is strong: the industry finally has a common transport, discovery, and task-lifecycle layer for autonomous agents.

A2A directly targets 03 of the 14 failure modes identified in the Berkeley audit:

  1. Role/Specification Issues - By standardizing the agent registry, A2A creates clear declarations of capabilities, skills, and endpoints. This addresses failures to follow task requirements and agent roles.
  2. Context Loss Problems - A2A's message chunking and streaming prevent critical information from being lost during handoffs.
  3. Communication Infrastructure - The HTTP + JSON-RPC/SSE protocol with a canonical task lifecycle provides consistent, reliable agent communication.

Several vendors have begun piloting A2A implementations, with promising but still preliminary results. However, it's important to note that quantitative proof remains scarce. None of these vendors has released side-by-side benchmark tables yet—only directional statements or blog interviews. Google's own launch blog shows a candidate-sourcing workflow where one agent hires multiple remote agents, but provides no timing or accuracy metrics. I won’t include names here because I believe we have established earlier that vendor tests can be unpublished, cherry-picked and may have included proprietary orchestration code.

In other words, A2A to agents is like what MCP to tool uses today. And just as supporting MCP doesn’t make the code of your tools better, A2A doesn’t make agents smarter.

What A2A Doesn't Solve

While something like A2A will fix the failures stemming from the lack of a protocol, it cannot fix issues that do not share the same root cause.

  • LLM-inherent limitations - Individual agent hallucinations, reasoning errors, and misunderstandings. Just LLM being LLM.
  • Verification gaps - The lack of formal verification procedures, cross-checking mechanisms, or voting systems. Without verification, you're essentially trusting an AI that thinks 2+2=5 to check the math of another AI that thinks 2+2=banana.
  • Orchestration intelligence - The supervisory logic for retry strategies, error recovery, and termination decisions. Just the other day, Claude was downloading my blog posts, hit a token limit, tried to repeat, continued to hit the token limit, and looped forever.

Those are 11 out of 14 failure modes. These areas require additional innovation beyond standardized communication protocols. Better verification agents, formal verification systems, or improved orchestration frameworks will be needed to address these challenges. I would love to go deeper into the innovation layers required to solve the “cognitive” failure of multi-agent models in another post.

Conclusion

Multi-agent systems are not the universal solution for today’s problem (yet). Many are here and working, but only at a scale where no single-agent approach can reach. These multi-agent systems deliver potential benefits, but at a substantial cost, one that can easily exceed an organization's capacity to invest or find people with the relevant expertise.

Rather than viewing multi-agent as the holy grail to be sought at all costs, we should approach it with careful consideration—an option to explore when simpler approaches fail. The claimed benefits rarely justify the implementation complexity for most everyday use cases, particularly when a well-designed single-agent system with appropriate guardrails can deliver acceptable results at a fraction of the engineering complexity.

Still, the conversation of multi-agent systems will always be around, in increased frequency, especially in light of the arrival of protocols like A2A. MCP is still the hype now, organizations are busy integrating MCP servers and writing their own before turning their attention to the next big thing. A2A could be that. Better agent alignment could also be that. Or it requires a new cognitive paradigm to improve agents’ smartness. Things will take another 6 months to unfold, which is the amount of time since the MCP announcement.

Either way, what’s clear is that it is a long way for agentic systems to hit their theoretical performance plateau.

Sunday, April 20, 2025

Finally, a break

Hi guys,

The news was out last Friday. I am going to leave Parcel Perform. 

For a sabbatical leave. Between May and October.

Sorry for the gasp. You come here for drama.

This has been planned since last year. I originally planned to leave soon after the 2024 BFCM, after the horizontal scalability of the system became a solved problem. I would like to say permanently, but I have learned that nothing ever is - the scalability, not me leaving for good. But then there were such and such issues with our data source (if you know, you know), and AI took the industry by storm. So I stayed. Eventually though, I knew I needed this break.

I have been on this job for almost 10 years. The first line of code was October 2015. I thought I would be done and move on in 5 years! I have been around longer than most furniture in the office and gone through 4 office changes. A decade is indeed a long time. It is a wet blanket that dampens any excitement and buzz that comes out of the work. Things get repetitive. Except for system incidents, I have lost count of how many ways things can combine to blow off. I praise every morning to wake up to no new alert.

When I was 23, I was fired from my job, and I was unemployed for 6 months. More like unemployable. It could have been longer had my savings not gone dry. I wrote, read, cooked, rode, swam, organized events, and lived a different life I didn't know I should. It was the time I needed to recover from depression. It was the best of my life. I want to experience that one more time.

Upon this news, I received some questions, the most common ones are below.

Why are you leaving now? Is something bad happening?

I am still the CTO of Parcel Perform, just on sabbatical leave. And on the contrary, I think this is a good window for me to take a break. The business is in its best shape since the painful layoff in 2022. We are positive about the future, we are actively expanding the team for the first time in 3 years. The Tech Hub, of which I am directly responsible, has demonstrated that in the face of unprecedented incidents, we are resilient, innovative, and get things done. With the multi-cluster architecture and other innovations, we won't face an existential scalability problem for a long time.

In the last 6 months, we have invested in incorporating GenAI into our product. I believe we have the right data, tech stack, and an experiment-focused process, though only time can tell. To be frank, all the fast-paced experiments we are doing, known internally as POC, reminded me of all the things I loved about working here in the early days. Ideas are discussed in a small circle, stretched on a whiteboard, implemented in less than a day, and repeated. It has been so fun recently that I got cold feet. Perhaps I shouldn't take this break yet. But I am not getting younger, I am getting married, and soon will start a family. I won't have time for myself in a long time. It has to be now.

What will you do during the break?

Oh wow I'm gonna play so much computer game, my brain rots. I have been a vivid fan of Age of Empires II since the time there was only one kid in my neighborhood with a computer good enough to play the game. I am an average player, slow even, so perhaps we are looking at more losses than wins. But hey, it builds character.

I will host board game nights here and there. It's another long-lasting hobby of mine, and a perfect social glue for my group of friends. While I am at that, I probably want to up my mocktail game too. My friends are largely in my age bracket, so for the same reasons above, my ultimate goal is to have more quality time.

As far as dopamine goes, that's it. I am not planning for retirement after all. Can't afford that yet.

To be frank, the pretext of this break is that I want to work on my sleep, which has been less than ideal for a long time. I couldn't figure out a single one thing that could improve my sleep so it is gonna be a total revamp. Distance from work. White bed sheet. Sleep hygiene. Gotta catch em all.

I am probably still awake more than 12 hours a day though. I will be reading as much as I can, fiction, non-fiction, and whatnot. Real life is crazy these days; the distinction is getting vague. There are some long Murakami novels I want to go through. I find that reading his works in one go, or at least with a minimal pause, offers the ideal immersive experience.

What I read, I write. I hope I can find an interesting topic to write every month. If you are keeping track, the last few days have been quite productive ;) I am starting my first subtrack AI Crossroads because that's how I feel these days: an important moment in my life, our lives, that I cannot afford to miss. I am excited and confused. I am sure somebody out there is feeling the same. 

And I will pick up Robotics as a new hobby. As GenAI gets "solved", its reach will not stop at the virtual world. Robotics seems to be the next logical frontier where a new generation of autonomous devices crops up and occupies our world. 

Writing about these things, I'm already pumped!

Who will replace your role?

The good thing about cooking up this plan from last year is that I have had plenty of time to arrange for my departure. The level of disruption should be minimal. People won't notice when I am gone and when I am back. Or so I hope.

There isn't a simple 1 to 1 replacement. Parcel Perform is not getting a new CTO, and there will still be a job for me when I am back. Or so I hope.

As a CTO, my work comes in 3 buckets: feature delivery, technical excellence, and AI research.

Feature delivery is where we have the most robust structure. Over the years, we have managed to find and grow a Head of Engineering and two Engineering Managers. The tribe-squad structure is stable. We are getting exponentially better at cross-squad collaboration as well as reorganization. There is a handful of external support ranging from QA, infrastructure, to data and insight to ensure the troops have the oophm they need to crush down deliveries.

Technical excellence means making sure Parcel Perform tech stack stays relevant for the next 5 years. This is an increasingly ambitious goal. Our tech stack is no longer a single web server. The team is growing. The world is moving faster. But we have 4 extremely bright Staff Engineers. They each have spent years at the organization, are widely regarded for the depth of their technical knowledge, and are definitely better than me on my best day in their field of expertise. We have spent the last couple of months aligning their goals with the needs of Parcel Perform. The alignment is at its strongest point ever since we adopted a goal-oriented performance review system.

Lastly, AI research is the preparation of the organization for the AI future, across technologies, processes, and strategic values. While I will continue the research in my own time, there is now a dedicated AI team that has been made the center of Parcel Perform's AI movement. Despite the humble beginning, the team is 2x their size in the coming months and won't let us "go gentle into that good night" that is our post-apocalyptic lives with the AI overlords.

I think we are in good hands.

What will you do when you come back?

Honest answer, I don't know.

Also honest answer, I don't think it gonna be the same as what I am doing today. Sure, some aspects gonna be the same. 5 months isn't that long. Neither is it short. The organization will continue to grow and evolve to meet the demand of the market, the gap I leave will be filled. When I am back, the organization will undoubtedly be different from what it is today. I will have to relearn how to operate effectively again. I will need to identify the overlap between my interests, my abilities, and the needs of the new Parcel Perform. 

Final honest answer, I am anxious for that future, and it is the best part.

AI is an insult to life itself. Or is it?

The Quote That's Suddenly Everywhere

"AI is an insult to life itself."

I only stumbled across this quote a couple of months ago, attributed to Hayao Miyazaki - the man behind the famed Studio Ghibli. Since then, I've noticed it's literally everywhere, particularly as we speak, there's a massive trend of people using ChatGPT and other image generators to create pictures in the distinctive Ghibli aesthetic. My social feeds are flooded with AI-crafted images of ordinary scenes transformed with that unmistakable Ghibli magic - the soft lighting, the whimsical elements, the characteristic designs, the stuff of my childhood. You know what, I am not good with this kind of words. Here is one from your truly.

The trend has brought Mr Miyazaki's quote back into the spotlight, wielded as a battle cry against this specific type of AI creation. There's just one small problem - this quote is being horribly misused.

I got curious about the context (because apparently I have nothing better to do than fact-check AI quotes when I should be testing that MCP server), so I dug deeper. Turns out, Mr Miyazaki wasn't condemning all artificial intelligence. He was reacting to a specific demonstration in 2016 where researchers showed him an AI program that had created a disturbing, headless humanoid figure that crawled across the ground like something straight out of a horror movie. For God's sake, the animation reminded him of a disabled friend who couldn't move freely. Yeah, I quite agree, that was the stuff of visceral nightmare.

Src: https://www.youtube.com/watch?v=ngZ0K3lWKRc&t=3s

It's also worth noting that the AI Mr Miyazaki was shown in 2016 was primitive compared to today's models like ChatGPT, Claude, or Midjourney. We have no idea how he might react to the current generation of AI systems that can create stunningly convincing Ghibli-style imagery. His reaction to that zombie-like figure doesn't necessarily tell us what he'd think about today's much more advanced and coherent AI creations. Yet the quote lives on, stripped of this crucial context, repurposed as a condescending umbrella on all generative AI.

The Eerie Valley of AI Art

Here's where it gets complicated for me. When I look at these AI-generated Ghibli scenes, they instantly evoke powerful emotions - nostalgia, wonder, warmth - all the feelings I've associated with films like "Spirited Away" or "Princess Mononoke" over years of watching them (for what it's worth, not a big fan of Totoro, it's ok). The visual language of Ghibli taps directly into something deep and meaningful in my experience.

That is what art does. That is what magic does. But this is not, isn't it? These mass-produced imitations feel like they're borrowing those emotions without earning them. I feel an unsettling hollowness to the "art" - like hearing your mother's voice coming from a stranger. The signal is correct, but the source feels wrong.

I'm confronted with a puzzling contradiction: if a human artist were to draw in the Ghibli style (and many talented illustrators do), I wouldn't feel nearly the same unease. Fan art is celebrated, artistic influence is natural, and learning by imitation is as old as art itself. So why does the AI version feel different?

So while the quote is misused, I wonder if Mr Miyazaki's statement might contain an insight that applies here too. These AI creations, in their skillful but soulless imitation, do feel like a kind of insult, not to life itself perhaps, but to the deep human relationship between artist, art, and audience that developed organically over decades.

The Other Side

As a human, I consume AI features. Yet as an engineer, I build AI features. And there is another side to this.

You probably have heard it. Every time on the Internet there is a complaint that the US' gun culture is a nut case, there is a voice from a dark and forgotten 4chan corner screaming back "A gun is just a tool. A gun does not kill people. People do. And if you take the gun away, they will find something else anyway".

But this argument increasingly fails to capture the reality of AI image generators. These systems aren't neutral tools - they've been trained on massive datasets of human art, often without explicit permission from the artists. When I prompt an AI to create "art in Ghibli style," I'm not merely using a neutral tool - I'm activating a complex system that has analyzed and learned from thousands of frames created by Studio Ghibli artists.

This is fundamentally different from a human artist studying and being influenced by Mr Miyazaki's work. The human artist brings their own lived experience, makes conscious choices about what to incorporate, and adds their unique perspective. The AI system statistically aggregates patterns at a scale no human could match, without discernment, attribution, or compensation.

I've built enough software systems to know that complexity breeds emergence. When algorithms make thousands or millions of decisions across vast datasets, the traditional model of direct human control becomes more of a fiction than a reality. You can't just look at the code and know exactly what it will do in every situation. Trust me, I've tried to "study" deep learning.

Perhaps most significantly, as these systems advance, the distance between the creator's intentions and the system's outputs grows. The developers at OpenAI didn't specifically write code that says "here's exactly how to draw a flattering image of a dude taking note on a motorcycle" - they created a system that learned from millions of images, and now it can generate Ghibli-style art that no human specifically programmed it to make. These AI systems develop abilities their creators didn't directly put there and often can't fully predict. This expanding gap between intention and outcome makes the "tools are neutral" argument increasingly unsatisfying.

This isn't to say humans have lost control entirely. Through system design, regulation, and deployment choices, we retain significant influence. But the "tools are neutral" framing no longer adequately captures the complex, bidirectional relationship between humans and increasingly sophisticated AI.

Why We Can't Resist Oversimplification

So far, there are two camps. The "AI is an insult" reflects people whose work and lives are negatively impacted. And the "Tools are neutral" defends AI creators. I tried, but I am sure I have done a less-than-stellar job capturing the thought processes from both camps. Still, as poorly as it is, it feels fairly complex. This complexity is exactly why we humans lean toward simplified narratives like "AI is an insult to life itself" or "It's just a tool like any other." The reality is messy, contradictory, and doesn't fit neatly into either camp.

Humans are notoriously lazy thinkers. I know I am. Give me a simple explanation over a complex one any day of the week. My brain has enough to worry about with keeping our production systems alive. I've reached the point where I celebrate every morning that the #system-alert-critical channel has no new message.

This pattern repeats throughout history. Complex truths get routinely reduced to easily digestible (and often wrong) summaries. Darwin's nuanced theory of evolution became "survival of the fittest." Einstein's revolutionary relativity equations became "everything is relative." Nietzsche's exploration of morality became "God is dead." In each case, profound ideas were flattened into bumper sticker slogans that lost the original meaning. Make good YouTube thumbnails though.

This happens because complexity requires effort. Our brains, evolved for quick decision-making in simpler environments (like not getting eaten by tigers), naturally gravitate toward cognitive shortcuts. A single authoritative quote from someone like Mr Miyazaki provides an easy way to validate existing beliefs without engaging with the messier reality.

There's also power in simple narratives. "AI threatens human creativity" creates a clear villain and a straightforward moral framework. It's far more emotionally satisfying than grappling with the ambiguous benefits and risks of a transformative technology. I get it - it's much easier to be either terrified of AI or blindly optimistic about it than to sit with the uncertainty.

I am afraid that in the coming weeks and months, we cannot afford such simplification.

The choice we have to make

Young technologists today (myself included) find ourselves in an extraordinary position. We're both consumers of AI tools created by others and creators of systems that will be used by countless others. We stand at the edge of perhaps the most transformative wave of innovation in human history, with the collective power to influence how this technology shapes our future and the future of our children. FWIW, I don't have a child yet, I like to think I will.

The questions raised by AI-generated Ghibli art - about originality, attribution, the value of human craft, the economics of creation - aren't going away. They'll only become more urgent as these systems improve and proliferate.

The longer I work in tech, the more I realize that the most important innovations aren't purely technical - they're sociotechnical. Building AI systems that benefit humanity requires more than clever algorithms; it requires thoughtful consideration of how these systems integrate with human values and creative traditions.

For those of us in this pivotal position, neither absolute rejection nor blind embrace provides adequate guidance. We will need to navigate through this, hopefully with better clarity than Christopher Columbus when he "lost" his way to discovering America. My CEO made me read AI-2027 - there is a scenario where humans fail to align AI superintelligence and get wiped out. Brave new world.

1. Embrace Intentional Design and Shared Responsibility

We need to be deliberate about what values and constraints we build into creative AI systems, considering not just what they can do, but what they should do. This might mean designing systems that explicitly credit their influences, or that direct compensation to original creators whose styles have been learned.

When my team started writing our first agent, we entirely focused on what was technically possible. Is this an agent or a workflow? Is this a tool call or a node in the graph? Long context or knowledge base? I know, technical gibberish. The point is, we will soon evolve past that learning curve, and what comes next is thinking through the implications.

2. Prioritize Augmentation Over Replacement

The most valuable AI applications enhance human creativity rather than simply mimicking or replacing it. We should seek opportunities to create tools that make artists more capable, not less necessary.

When I see the flood of AI-generated Ghibli art, I wonder if we're asking the right questions. The most exciting creative AI tools don't just imitate existing styles - they help artists discover new possibilities they wouldn't have found otherwise. The difference between a tool that helps you create and one that creates instead of you may seem subtle, but it's profound. 

I have been lucky enough to be part of meetings where the goal of AI agents is to free the human colleagues from boring, repetitive tasks. I sure hope that trajectory continues. Technology should serve human values, not the other way around.

3. Ensure Diverse Perspectives and Continuous Assessment

The perspectives that inform both the creation and governance of AI systems should reflect the diversity of populations affected by them. This is especially true for creative AI, where cultural context and artistic traditions vary enormously across different communities.

It's so easy to build for people in my immediate circle and call it a day. As an Asian, I see how AI systems trained predominantly on Western datasets create a distorted view of creativity and culture. Without clarification, a genAI model would assume I am a white male living in the US. Bias, prejudice, stereotype. We have seen this.

Finding My Way in the AI Landscape

The reality of our relationship with AI is beyond simple characterization. It is neither an existential threat to human creativity nor a neutral tool entirely under our control. It represents something new - a technology with growing capabilities that both reflects and reshapes our creative traditions.

Those Ghibli-style images generated by AI leave me with mixed feelings that I'm still sorting through. On one hand, I'm amazed by the technical achievement and can't deny the genuine emotions they evoke. On the other hand, I feel I am being conditioned to feel that way.

Perhaps this ambivalence is exactly where we need to be right now - neither rejecting the technology outright nor embracing it uncritically, but sitting with the discomfort of its complexity while we figure out how to move forward thoughtfully.

For our generation that will guide AI's development, the challenge is to move beyond reductive arguments. Neither blind techno-optimism nor reflexive technophobia will serve us well. Instead, we need the wisdom to recognize both the extraordinary potential and the legitimate concerns about these systems, and the courage to chart a course that honors what makes human creativity valuable in the first place.


This post was written with assistance from Claude, which felt a bit meta given the topic. They suggested a lot of the structure, but all the half-baked jokes and questionable analogies are mine alone. And it still took me a beautiful Saturday to pull everything together in my style.


Sunday, April 6, 2025

MCP vs Agent

On 26th Nov 2024, Anthropic introduced the world to MCP - Model Context Protocol. Four months later, OpenAI announced the adoption of MCP across its products, making the protocol the de facto standard of the industry. Put it another way, a few months ago, we were figuring out when something is a workflow and when it is an agent (and we still are). Today, the question is how much MCP Kool-Aid we should drink.

What is MCP?

MCP is a standard to connect AI models and external data sources or tools. This allows models to integrate with external systems independently of platform and implementation. Before MCP, there is tool use, but the integration is platform-specific, be it LangChain, CrewAI, LlamaIndex, and whatnot. An MCP Postgres server, however, works with all platforms and applications supporting the protocol. MCP to AI is HTTP to the internet. I think so, I was a baby when HTTP was invented. 

I won't attempt to paraphrase the components of MCP; modelcontextprotocol.io is dedicated to that. Here is a quick screenshot.

If you have invested extensively in tool use, the rise of MCP doesn't necessarily mean that your system is obsolete. Mind you, tool use is probably still the most popular integration method out there today, and can be made MCP-compatible with a wrapper layer. Here is one from LangChain.

A system with both tool use and MCP looks something like this.

For a 4-month-old protocol, MCP is extremely promising, yet far from the silver bullet for all AI integration problems. The most noticeable problems MCP has not resolved are:

  • Installation. MCP servers run on local machines and are meant for desktop use. While this is okay-ish for professional users, especially developers, it is unsuitable for casual use cases such as customer support.
  • Security. This post describes "tool poisoning" and it is definitely not the last.
  • Authorization. MCP's OAuth 2.1 adoption is a work in progress. OAuth 2.1 itself is still a draft.
  • Official MCP Registry. As far as I know, there are attempts to collect MCP servers, they are all ad-hoc and incomplete, like this and this. The quality is hit and miss, official MCP servers tend to fare better than community ones.
  • Lack of features. Streaming support, stateless connection, proactive server behavior, to name a few.
None of these problems indicates a fundamental flaw of MCP and in time all will be fixed. As with any emerging technology, it is good to be optimistically cautious when dealing with these hiccups.

MCP in SaaS

I work at a SaaS company, building applications with AI interfaces. I was initially confused by the distinction between agent and MCP. Comparing agent to MCP is apple to orange, I know. MCP is more on par with tool use (on steroids). Yet the comparison makes sense in this context: if I only have so many hours to build assistance features for our Customer Support staff, should I build a proprietary agent or an MCP server connecting to, say, Claude Desktop App, assuming both get the work done? Perhaps at this point, it is also worth noting that I believe office workers in the future will spend as much time on AI clients like Claude or ChatGPT as they do on browsers today. If not more. Because the AI overlord doesn't sleep and neither will you!

Though I ranted about today's limitation of MCP, the advantages are appealing. Established MCP clients, like Claude Desktop App, handle gnarly engineering challenges such as output token pagination, human-in-the-loop, or rich content processing with satisfying UX. More importantly, a user can install multiple MCP servers on their device, both in-house and open-sourced, which opens up various workflow automation possibilities.

When I build agents, I noticed that a considerable amount of time went to plumbing - overhead tasks required to get an agent up and running. It couldn't be overcome with better boilerplate code, but still... An agent is also a closed environment where any non-trivial (and sometimes trivial) changes require code modification and deployment. The usability is limited to what the agent's builder provides. And the promise of multi-agent collaboration, even though compelling, has not quite been delivered yet. Finally, an agent is as smart as the system prompts it was given. Bad prompts, like the ones made by yours truly, can make an agent perform far worse than the Claude Desktop App.

Then why are agents not yesternews yet? As far as I know, implementing an agent is the only method that provides full behavior control. Tool call (even via MCP) in an agent is deterministic, and everything related to agent reasoning can be logged and monitored. Claude Desktop App, as awesome as it is, offers little explanation of why it does what it does (though that too can be changed in the future). Drinking too much MCP Kool-Aid could also mean giving too much control of the system to third-party components - the MCP clients - and can lead to an existential threat.   

Conclusion

The rise of MCP is certain. Just as MCP clients might replace browsers as the most important productivity tool, at some point, vendors might cease asking for APIs and seek MCP integration instead. Yet it will not be a binary choice between MCP servers and agents. Each offers distinct advantages for different use cases.
  • MCP allows cooperating a large number of tools, making it suitable for internal dog flooding and fast exploration. MCP would also see more adoption among professional users than casual ones.
  • Agents being more controlled, tamed, and preferably well tested would be the default choice for customer-facing applications.
Within a year, we'll likely see a hybrid system where MCP and agent-based approaches coexist. Of course, innovations such as AGI or MCP going beyond its initial local and desktop use can change this balance. There is no point in predicting long-term future at this point.

Tuesday, February 25, 2025

The Book Series - Management 3.0



* Management 1.0 is traditional top-down decision-making.
* Management 2.0 improves upon traditional management with agile principles adoption and continuous improvement.

That, ladies and gentlemen, is the summary I never remember when I think of this book. 

I love this book. Jurgen Appelo is a great speaker and a greater writer. He is characteristically blunt. One of the chapters goes along the lines of: a manager delegates to his team members is not an art of altruism, he does it so that he has the time to do things he wants to - and proceed to retreat to his ivory tower. , Though he probably would prefer "Dutch frankness" instead. He tells jokes with an emotionless straight face. My kind of joke.

Management 3.0 laid out a complete framework to revamp leadership. There were 5-6 principles built around some other theories, like complexity, game, and network. There is more to that half-ass summary, the details are just a Google away if you are interested. What makes this book unique to me is actually the discussion of management theories before diving into good/bad practices. Each principle is split into two chapters, one on the theory and the other on the execution. Other works in this genre tend to be more dogmatic if I could say so.
 
I am no expert in complexity science. There is a non-zero chance all this theory inclusion is pseudo-science. But I like the attempt and pseudo or not, there is a framework to cling to when in doubt. And there will be a lot of doubts. Management-books-praise-human-ability-to-self-organize-but-that-doesn't-seem-to-cover-my-staff-inertia-to-stay-put kind of doubt. To be frank, the stark contrast between fiction and non-fiction typically comes with management literature exists here as well. There is no unicorn.

Management 3.0 since its debut a decade ago has exploded in popularity though. There are training courses, consultant programs, and god damned certifications. It is going down the dark commercial beaten path all so well-known Agile has trodden before. And that dented my heart a bit. But hey one can only retreat to his ivory tower with a fat bank account.

I still think you should give it a try though.