My short answer: I ship production AI agents with Claude API direct, Trigger.dev for orchestration, Supabase for memory and state, and Langfuse for observability. That stack runs every agent I have in production for multi-location clients. I tested LangChain, AutoGen, and CrewAI. I do not use any of them in production. The frameworks add layers of abstraction that look clean in a demo and break the moment you need to debug something at 3am. At Formaum, this is the exact stack I ship with for multi-location operations clients.

The honest take: most "agent frameworks" are not what you need

Every "best AI agent frameworks 2026" list I read ranks tools by features. That is the wrong axis. The right axis is what survives in production over six months with real users, real edge cases, and real money on the line.

An AI agent is four things. An LLM that reasons. Tools it can call. Memory it can read and write. Observability so you can see what it did. That is the whole picture. Frameworks bundle those four into one abstraction and sell you the bundle. The bundle leaks. When it leaks, you are debugging the framework, not the agent.

Every agent I have shipped that lasted more than ninety days was built by composing those four layers directly. Every agent I built on top of a heavyweight framework either got rewritten or got abandoned.

The 4-layer architecture I actually use

Before naming frameworks, here is the architecture. If a framework cannot map cleanly onto these four layers, it is doing too much.

  1. LLM layer. The model that decides what to do next. Claude Sonnet 4.7 or Opus 4.7 for everything I ship. GPT-5 occasionally when the client already pays for it.
  2. Tool layer. The functions the agent can call. CRM updates, calendar bookings, SMS sends, database queries. Plain Python or TypeScript functions. No wrappers.
  3. Memory layer. Where state lives between runs. Supabase, every time. Sometimes a thin Redis cache in front.
  4. Observability layer. Traces of every run. Inputs, outputs, tool calls, latency, cost. Langfuse.

That is the spec. Everything below is what I picked for each layer and what I rejected.

Frameworks I actually ship with

Claude API direct

Verdict: the default for custom agents. Skip the wrappers.

I call the Anthropic SDK directly. Tools are defined as Python or TypeScript functions and passed in the tools array. Prompt caching is on. Extended thinking is on for anything that needs to plan. The whole agent loop is maybe sixty lines of code I can read in one screen.

Most posts will tell you to pick a framework so you do not have to write the agent loop yourself. That loop is fifty lines. Writing it yourself gives you control over retries, error handling, max iterations, tool validation, and cost guards. Frameworks hide all of that behind decorators.

If you only learn one thing from this post: read the Anthropic SDK docs, write the loop once, and you will never want a framework again for single-agent work.

Claude Code (for engineering agents)

Verdict: the only coding agent I trust with my own codebase.

Claude Code is different from the other tools here. It is an agent designed for one job: working on code. I use it for refactors, migrations, debugging, and writing new features. The hooks system, subagents, and skills architecture are what make it production-grade rather than a toy.

If you are building an agent that writes or modifies code, the answer is Claude Code or the Claude Agent SDK underneath it. Do not roll your own coding agent on top of a generic framework. The orchestration of file edits, test runs, and grep loops is non-trivial and Anthropic has already solved it.

Trigger.dev (orchestration)

Verdict: the missing piece nobody talks about.

Every framework comparison post forgets the operational layer. An agent that runs once when you call it is a demo. An agent that runs every thirty minutes, retries on failure, fans out to subtasks, and tells you when it crashed is a product.

Trigger.dev does that. I write the agent logic as a normal task function. Trigger handles scheduling, retries, queueing, dead-letter handling, and the dashboard. When something breaks I see exactly which run failed and at which step.

The alternative is rolling your own with cron, a queue, a database, and a dashboard. I have done that. Trigger replaces a week of infrastructure with one config file.

Langfuse (observability)

Verdict: non-negotiable for any agent past prototype.

You cannot debug what you cannot see. Langfuse traces every LLM call, every tool call, every input, every output, the latency, and the cost. When a client says "the agent did something weird last Tuesday", I open Langfuse, find the trace, and have the answer in two minutes.

I tried LangSmith. It works but locks you into the LangChain ecosystem. Langfuse is open source, self-hostable, and model-agnostic. For my stack it is the right call.

Frameworks I tested and rejected

LangChain

Verdict: too much surface area, too many breaking changes.

LangChain is the framework most lists rank number one. I tried it on two production builds in 2024 and 2025. Both times the abstraction layer cost me more than it saved. Chains and agents have overlapping APIs. The library updates break my code every few months. The documentation is split across LangChain, LangGraph, and LangSmith and they do not always agree.

If you are building a RAG pipeline with twenty retrievers and need every loader and chunker under the sun, LangChain is fine. For agents, write the loop yourself.

LangGraph

Verdict: the best of the framework options, but still more complexity than I need.

LangGraph is genuinely well designed. The state graph model is the right mental model for stateful agents. If you have a team that needs an opinionated framework with human-in-the-loop and checkpointing baked in, this is the one to pick.

I do not pick it because I get the same thing from Trigger.dev plus a sixty-line agent loop, without learning a new abstraction or coupling myself to the LangChain ecosystem. For a solo engineer or a small team, the framework tax is not worth it.

AutoGen

Verdict: a research framework, not a production framework.

AutoGen popularised conversational multi-agent systems. Agents talk to each other to solve a problem. It is fun to play with. It is hard to constrain when you need predictable behaviour. The Microsoft v0.4 rewrite and the AG2 community fork split the ecosystem. I tested both and shipped neither.

If you are writing a research paper or exploring agent dynamics, use it. If you are shipping for a paying client, do not.

CrewAI

Verdict: nice demos, weak production story.

CrewAI uses a researcher-writer-reviewer metaphor. You define agents with roles and goals and they collaborate. The demo videos are great. The production reality is that role-based collaboration usually compresses into one well-prompted agent with the right tools.

Every time I scoped a CrewAI build, I ended up rewriting it as a single Claude API call with a system prompt that played all the roles. Faster, cheaper, easier to debug, same result.

How to pick for your use case

Three questions. Answer them honestly.

  1. Is this a single agent or a multi-agent system? If single, Claude API direct. If multi, you almost always still want single with subagents triggered as tools. True multi-agent is rarely the right answer.
  2. Does it run on a schedule or fan out to subtasks? If yes, Trigger.dev underneath. If it is request-response only, a serverless function is fine.
  3. Will this run for more than thirty days in production? If yes, Langfuse from day one. Do not retrofit observability.

If you answered yes to all three and you still want a framework, pick LangGraph. It is the cleanest of the framework options. Otherwise, compose the four layers and own the code.

Common mistakes

Picking the framework before defining the agent. Frameworks shape what your agent can do. Decide what the agent does first. Then pick the smallest tool that supports it.

Skipping observability. Every team that skips observability comes back to add it after the first production incident. Add Langfuse on day one. It takes thirty minutes.

Multi-agent when single agent works. Roles, crews, and conversational agents look sophisticated. They almost always collapse into one well-prompted agent with the right tools. Try the single agent first.

Treating the LLM as the whole product. The model is one of four layers. The other three are tools, memory, and observability. If you only think about the prompt, the agent will not survive contact with real users.

Hourly billing your way through a framework refactor. Every framework rewrite I have done burned two weeks. The math on framework switching costs is brutal. Pick the simplest stack you can ship on and stay there.

The stack I would build today

If I started a new agent project tomorrow morning, here is the exact stack I would use:

That is the whole list. No LangChain, no CrewAI, no AutoGen. Five tools, each doing one job, each with a clean API, each replaceable. Production agents that run on a Tuesday at 3am when nobody is watching.

That is the work. The frameworks are the distraction.

Run on a stack that's holding you back?

Book a 45-minute discovery call. I'll map what moves, what stays, and what makes sense for your operation.

Book a call

Frequently Asked Questions

What is the best AI agent framework in 2026?
There is no single best framework. The best stack is Claude API direct plus Trigger.dev for orchestration plus Supabase for state plus Langfuse for observability. Frameworks like LangChain and CrewAI add abstraction layers that cost more than they save in production. If you must pick a framework, LangGraph is the cleanest option.
Should I use LangChain for an AI agent?
Not for agents. LangChain is fine for RAG pipelines with many retrievers. For agents, the agent loop is about sixty lines of code. Writing it yourself with the Anthropic or OpenAI SDK gives you full control over retries, validation, and cost guards. LangChain's frequent breaking changes make production maintenance painful.
Is CrewAI good for multi-agent systems?
CrewAI has clean demos and a clear role-based mental model. In practice, most multi-agent systems collapse into a single well-prompted agent with the right tools. Try the single-agent version first. If you genuinely need separate agents, subagents called as tools are usually a better pattern than CrewAI crews.
What is the difference between Claude Code and the Claude Agent SDK?
Claude Code is a finished product. It is an engineering agent that runs in your terminal and works on your codebase. The Claude Agent SDK is the underlying framework that powers Claude Code. Use Claude Code if you want a coding agent today. Use the Agent SDK if you are building a custom agent that needs hooks, MCP integration, and subagents.
Do I need observability for an AI agent?
Yes, from day one. Every agent I have shipped that did not have observability got it retrofitted after the first production incident. Langfuse is the tool I use. It traces every LLM call, tool call, latency, and cost so you can debug exactly what the agent did and when. Thirty minutes to set up. Non-negotiable past prototype.
Genevieve Claire
Genevieve Claire
Founder, Formaum — Claude Code Expert & Full-Stack AI Engineer

Builds bespoke AI automation systems for multi-location operations. Previously EA Sports FIFA ($7B franchise) and Film/TV VFX on Skyfall, Avengers, Game of Thrones. Based in Vancouver, BC.