How to Build an AI Agent That Runs in Production (Not Just a Demo)

A production AI agent has four layers: an LLM brain that reasons, a tool layer that lets it act on the real world, a memory layer that gives it context across runs, and an observability layer that tells you when something broke at 3am. Skip any one of those and you have a demo, not a system. I'm Genevieve Claire and I build AI agents for multi-location operations at Formaum. Here is the architecture, the stack, and the build process I use when an agent has to actually run on a Tuesday at 3am when nobody is watching.

The difference between a demo agent and a production agent

Most tutorials show you a script. You run it in a notebook, it calls an LLM, it prints a result. That is not an agent. That is a function call wrapped in a loop.

A production agent has to handle the things tutorials skip. The API times out. The tool returns null. The user asks for something out of scope. The context window fills up halfway through a multi-step task. The agent hallucinates a tool name. The database is locked. The customer is angry. The cron fires at 3am and nobody is awake to babysit it.

The difference between a demo and a production agent is not the model. It is everything around the model.

Architecture: the 4 layers

Every production agent I ship has the same four layers. The stack inside each layer changes by client. The layers do not.

1. The LLM brain

This is the reasoning engine. Today I default to Claude Sonnet 4.6 for most agent work and Claude Opus when the task needs deeper reasoning. The brain decides what to do next. It does not do the doing.

Your job here is to write the system prompt, define the agent's scope, and pick the model that matches the task. Mechanical work goes to a cheap model. Judgment goes to a smart one. Mixing tiers inside one agent is a cost lever most people miss.

2. The tool layer

An agent without tools is just a chatbot. Tools are the functions the LLM can call: send an SMS, write to a CRM, query a database, book a meeting, charge a card.

Every tool needs three things: a tight schema so the model cannot pass garbage, a real error return so the model knows when it failed, and idempotency so a retry does not double-charge a customer. Tools are where most production agents quietly leak money.

3. Memory

Memory is what the agent remembers between turns and between runs. Two kinds matter. Short-term memory is the current conversation or task state, held in the context window. Long-term memory is what the agent recalls about a customer, an account, or a past run.

For long-term memory I use Supabase with pgvector for semantic recall and a regular Postgres table for structured facts the agent should never get wrong. Names, account IDs, dollar amounts, and policy rules belong in structured memory, not in a vector store. Vector stores hallucinate quietly. Tables do not.

4. Observability

This is the layer most people skip and the one that decides whether the agent survives. You need to see every prompt, every tool call, every cost, every failure, in real time, searchable, without SSHing into a server.

I use Langfuse for trace logging. Every agent run gets a trace. Every tool call is a span. Every cost is tagged. When a client tells me "the bot did something weird yesterday at 4pm," I open Langfuse, filter to that window, and read the actual conversation. No guessing. No log diving.

Pick the stack

Here is the stack I ship on. It is opinionated. It is not the only stack that works. It is the one I have shipped to production enough times to trust.

Brain: Claude API (Sonnet 4.6 default, Opus for high-judgment tasks, Haiku for mechanical sub-agents)
Runtime: Trigger.dev for scheduled jobs, retries, queueing, and long-running tasks. This is the part that makes "runs at 3am" actually true.
Database and memory: Supabase. Postgres for structured facts, pgvector for semantic recall.
Observability: Langfuse for traces, costs, and prompt versioning.
External tools: whatever the client already runs on. GoHighLevel, Twilio, Stripe, Google Workspace, ClickUp, Slack. The agent goes to the data. The data does not move for the agent.

For one-off internal agents I use Claude Code directly with custom skills. For client-facing production systems I move to Trigger.dev so the schedule, retries, and queueing are not my problem at 2am.

The 5-step build process

This is the order I build in. Skipping a step is how you ship a demo by accident.

Step 1: Define one job

One agent. One job. "Recover no-show leads in the first 24 hours" is a job. "Be a helpful assistant" is not. If you cannot write the job in one sentence with a measurable outcome, you are not ready to build.

Step 2: Map the tools

List every action the agent needs to take. For each one, find or write the API. Define the schema. Define what success looks like. Define what failure looks like. Most agents fail because a tool was assumed to exist and did not, or returned a shape the model could not parse.

Step 3: Write the system prompt with explicit guardrails

The system prompt tells the agent who it is, what it can do, what it cannot do, and how to escalate. Be specific. "Never promise a refund" is a guardrail. "Be careful" is not. I keep system prompts under 2,000 tokens and version every change in Langfuse.

Step 4: Build the loop with observability from day one

Wire up the LLM call, the tool dispatcher, and the trace logger in the first pass. Do not bolt observability on later. You cannot debug what you cannot see, and you will need to debug from day one.

Step 5: Eval before launch

Build a test set of 20 to 50 real cases pulled from the client's actual history. Run the agent against them. Read every trace. Find the edge cases. Patch the prompt. Re-run. An agent that passes 18 of 20 real cases is closer to production than one that nailed a polished demo.

Guardrails: what catches the agent before production breaks

Guardrails are the difference between an agent you can sleep through and an agent you have to babysit.

Schema validation on every tool input. If the model passes a malformed argument, the tool refuses to run and tells the agent to try again. This catches hallucinated parameters before they hit your database.
Allow-list for destructive actions. The agent can read freely. The agent cannot delete, charge, or send to anyone not on the approved list without an explicit human step.
Cost cap per run. Hard ceiling on tokens spent in a single trace. If the agent loops, the cap kills it before the bill does.
Escalation path. When the agent does not know, it routes to a human in Slack with the full context. Confident wrong answers are the most expensive failure mode.
Replayable traces. Every run can be re-run from any step. When a client says "this was wrong," I replay, patch, and verify in minutes, not hours.

A real example

One of my clients runs a multi-location operation across five sites. Before the agent, missed leads sat in a spreadsheet until someone got to them, usually a day late. The recovery rate on a 24-hour-old lead is roughly half of what it is on a one-hour-old lead. Real money was leaking every night.

The agent I built has one job: when a lead misses an intake step, re-engage them by SMS in under five minutes, book them into the calendar if they reply, and escalate to a human if the conversation goes off-script. It runs on Trigger.dev with a webhook trigger from the CRM. The brain is Claude Sonnet. Memory is in Supabase. Every trace is in Langfuse.

In the first month it recovered enough leads to cover its build cost three times over. Nobody on the team had to watch it. That is the bar. Recovered revenue against build cost is the only ROI math that matters.

Common mistakes

Picking the framework before the problem. LangChain, CrewAI, AutoGen, the OpenAI Agents SDK. They are tools, not strategies. Pick the framework after you have written the system prompt and listed the tools, not before.
One giant agent instead of small specialised ones. A do-everything agent is harder to prompt, harder to debug, and harder to control. Split by job. Let a router decide which agent runs.
No eval set. Shipping without a test set is shipping blind. You will not know it broke until a customer tells you.
Skipping observability. If you cannot read every trace in production, you do not own the system. The model owns it.
Treating memory as one thing. Vector store for structured facts is how you get the customer's name wrong. Structured table for fuzzy recall is how you get nothing back. Pick the right tool per fact type.
Building before defining the failure mode. What does this agent do when it does not know? Write that answer before you write the system prompt.

The agents that survive are not the ones with the smartest model. They are the ones with the tightest tools, the cleanest guardrails, and the trace logs you can actually read at 3am. Build for the 3am Tuesday, not the demo.

Run on a stack that's holding you back?

Book a 45-minute discovery call. I'll map what moves, what stays, and what makes sense for your operation.

Book a call

Frequently Asked Questions

What is the easiest way to build an AI agent in 2026?

Start with the Claude API and a single tool. Write a system prompt that defines one job. Add a trace logger like Langfuse from the first commit. Get one job working end to end before adding a framework. Frameworks help once you have more than one agent. Before that they add complexity you do not need.

Do I need LangChain or CrewAI to build an AI agent?

No. For a single-purpose production agent, calling the LLM API directly is faster, cheaper, and easier to debug. Frameworks earn their weight when you have multiple agents that share tools, memory, and routing logic. Pick the framework after the problem, not before.

How much does it cost to build and run an AI agent?

Build cost varies with scope. A focused production agent typically lands between $6K and $25K depending on tool surface area and integrations. Running cost is usually dominated by tokens. A well-tuned agent runs on a few cents to a few dollars per task. The ROI math that matters is recovered revenue against build cost, not absolute spend.

What is the difference between an AI agent and a chatbot?

A chatbot answers questions. An agent does a job. The agent has tools it can call, memory it can use, and a runtime that lets it take multi-step actions across systems. A chatbot stops at the reply. An agent keeps going until the task is done.

How do I make sure an AI agent does not break in production?

Four things. Schema-validate every tool input. Allow-list every destructive action. Cap cost per run. Log every trace to an observability tool like Langfuse. Then build an eval set of 20 to 50 real cases and run it before every prompt change. Most production failures are caught by the guardrails, not the model.

Genevieve Claire

Founder, Formaum — Claude Code Expert & Full-Stack AI Engineer

Builds bespoke AI automation systems for multi-location operations. Previously EA Sports FIFA ($7B franchise) and Film/TV VFX on Skyfall, Avengers, Game of Thrones. Based in Vancouver, BC.