AI Agent Architecture: The 6 Layers Every Production System Needs

A production AI agent architecture has six layers: an LLM brain, a tool layer, a memory layer, an observability layer, an orchestration runtime, and an escalation path to a human. Drop any one of these and the system breaks in production. I am Genevieve Claire, an AI automation engineer at Formaum, and this is the reference architecture I ship to multi-location operations clients.

The honest reality: most AI agent architecture diagrams are theoretical

Search this term and you get the same boxes redrawn a hundred ways. Perception. Reasoning. Action. A loop in the middle. ReAct. Plan-and-execute. Neural-symbolic. The diagrams are correct and useless.

They describe what an agent is. They do not describe what an agent needs to run for a year without a human babysitting it. That is a different document.

I architect, build, and deploy AI agents for businesses with real revenue and real operational drag. Background is EA Sports FIFA and film production. Coordination problems at scale are the work. What follows is what actually ships.

The 6-layer production architecture

Here is the stack. Every layer has a job. Every layer has a tool I default to.

┌─────────────────────────────────────────────────┐
│  6. ESCALATION    Human-in-loop, confidence     │
├─────────────────────────────────────────────────┤
│  5. ORCHESTRATION Trigger.dev runtime, queues   │
├─────────────────────────────────────────────────┤
│  4. OBSERVABILITY Langfuse, audit logs, traces  │
├─────────────────────────────────────────────────┤
│  3. MEMORY        Postgres + pgvector, episodic │
├─────────────────────────────────────────────────┤
│  2. TOOLS         MCP servers, function calls   │
├─────────────────────────────────────────────────┤
│  1. BRAIN         LLM (Claude, GPT, routed)     │
└─────────────────────────────────────────────────┘

The order matters. The brain calls tools. Tools mutate memory. Memory writes traces. Traces feed observability. Observability triggers escalation. Orchestration runs the whole loop on a schedule or a webhook. Cut any layer and the system stops being production-grade.

Layer 1: The LLM brain

The model is not the architecture. It is one component. I default to Claude Sonnet for most agent work. I switch to Opus when the task needs deep reasoning and to Haiku for mechanical classification.

The rule I live by: pick the cheapest model that can do the job. A 4o-mini call costs a fraction of a Sonnet call. If the agent is sorting an email into one of six labels, that is a Haiku job. If the agent is writing a client-facing reply, that is a Sonnet job. If the agent is making a strategic call, that is an Opus job.

Most production systems I see route between two models. A small fast model handles the 90% case. A bigger model handles the 10% where reasoning matters. Build the router on day one.

Layer 2: The tool layer

An agent without tools is a chatbot. The tool layer is what makes it act in the world.

Three things matter here. First, the interface. I use MCP servers where they exist and function calling where they do not. MCP gives me a clean contract between the model and the system. Second, the allow-list. Every tool the agent can call is explicit. No agent should ever have unrestricted access to a production system because a connector exists. Third, retry logic. Tools fail. Networks blip. The agent needs to know the difference between a transient failure and a real one.

A real example from a client build. The agent has access to GoHighLevel, Twilio, the CRM database, and a Slack channel. It cannot access the billing system. It cannot send mass SMS. It cannot delete contacts. The allow-list is shorter than the tool list. That is the point.

Layer 3: Memory

This is where most architectures fail. They build short-term memory into the context window and call it done. That works for a chat demo. It does not work for an agent that runs for a year.

My default memory pattern:

┌────────────────────────────────────────┐
│  Short-term: context window            │
│  Long-term: Postgres (structured)      │
│  Semantic: pgvector (embeddings)       │
│  Episodic: events table (timestamped)  │
└────────────────────────────────────────┘

Structured Postgres holds the things the agent needs to know precisely: who the contact is, what stage they are at, what was last sent. Vector storage holds the things the agent needs to retrieve by meaning: past conversations, knowledge base entries, transcripts. An events table records every agent action with a timestamp.

I do not use a dedicated vector DB unless the data volume justifies it. Postgres with pgvector handles most client workloads cleanly. One database. One backup. One mental model.

Layer 4: Observability

If you cannot see what your agent is doing, you do not have an agent. You have a black box that occasionally costs you money.

I run Langfuse on every production agent. Every LLM call is traced. Every tool call is logged. Every decision the agent makes has a record. When something goes wrong, and something always goes wrong, I can answer three questions in under a minute: what did the agent see, what did it decide, what did it do.

The audit log is also a compliance artifact. For clients in regulated industries this is non-negotiable. Even for clients who are not regulated, the audit log is how I debug at 3am without waking anyone up.

Layer 5: Orchestration

The runtime is the part nobody talks about and the part that breaks first.

Your agent does not run in a notebook. It runs on a schedule, or on a webhook, or in response to an event. It needs retries. It needs queues. It needs durable state. It needs to recover from a partial failure without re-running the whole job.

I use Trigger.dev for almost every production agent I ship. The reasons: durable execution, native retry logic, real queues, version pinning, and a UI my clients can read. When a job fails, I can re-run from the failed step. When a job takes ten minutes, the runtime does not time out at five.

The alternatives are AWS Lambda + Step Functions, Temporal, or a hand-rolled queue. They all work. Trigger.dev is the fastest path to production for the team sizes I build for.

Layer 6: Escalation

This is the layer that turns an agent from a demo into a system a business can trust.

Every agent I ship has a confidence threshold. Below the threshold, the agent does not act. It escalates. It posts to a Slack channel, drafts an action for a human to approve, or pauses the workflow until a person signs off.

The trigger is not always model confidence. Sometimes it is rule-based. Any action that sends money escalates. Any action that contacts a VIP customer escalates. Any action the agent has not seen before escalates.

The point is that the human is not a fallback for when the agent fails. The human is a designed-in layer of the architecture. The agent knows what it is allowed to do alone and what it has to ask about.

A real example architecture

Here is an anonymized client system. Multi-location service business. The agent qualifies inbound leads, books discovery calls, and follows up with no-shows.

WEBHOOK (form submission)
        │
        ▼
  Trigger.dev job
        │
        ├─► Haiku: classify lead intent
        │
        ├─► Postgres: fetch contact history
        │
        ├─► Sonnet: draft response
        │
        ├─► Confidence check
        │       ├─► high  → Twilio send SMS
        │       └─► low   → Slack escalate
        │
        ├─► Langfuse: log trace
        │
        └─► Postgres: write event

Six layers. One job. Runs hundreds of times a day. Cost per execution is a few cents. The business owner sees the Slack channel and the Langfuse dashboard. That is the entire interface they need.

Common architecture mistakes

The mistakes I see in production audits are the same ones every time.

Skipping observability. The team ships an agent without traces because traces feel like overhead. Three months later they cannot debug a single failure. Add observability on day one.

Treating memory as an afterthought. Stuffing everything into the context window works until it does not. Design the memory schema before you write the first prompt.

No allow-list on tools. The agent has access to everything because that was easier during development. The first time it does something it should not, you find out the hard way.

No orchestration runtime. The agent runs in a cron job on a single server. When the server reboots, jobs are lost. When a job takes too long, it gets killed. Use a real runtime.

No escalation path. The agent acts on every decision regardless of confidence. The first low-confidence action that goes to a customer is the one you have to apologize for.

Picking the framework before the architecture. LangGraph, CrewAI, Claude Agent SDK, custom code. The framework is downstream of the architecture. Decide what the six layers look like first, then pick the framework that fits.

The bottom line

An AI agent architecture is not a diagram of perception and reasoning loops. It is a six-layer stack with a job at every layer and a tool at every layer. Brain, tools, memory, observability, orchestration, escalation. Build all six on day one or build them in production after something breaks.

I build these systems end-to-end for clients running multi-location operations. If you have an agent that works in a demo but breaks in production, the answer is almost always a missing layer.

Run on a stack that's holding you back?

Book a 45-minute discovery call. I'll map what moves, what stays, and what makes sense for your operation.

Book a call

Frequently Asked Questions

What are the layers of an AI agent architecture?

A production AI agent architecture has six layers: the LLM brain that reasons, the tool layer that lets the agent act, the memory layer that holds structured and semantic context, the observability layer that traces every call, the orchestration runtime that runs the loop reliably, and the escalation path that hands off to a human when confidence is low.

Do I need a vector database for an AI agent?

Not always. Postgres with the pgvector extension handles most production workloads cleanly. A dedicated vector database is only worth the operational overhead when the data volume or query latency requires it. One database with both structured and semantic storage is simpler to operate.

What is the best orchestration runtime for AI agents?

Trigger.dev is my default for client agents because of durable execution, native retries, real queues, and a UI clients can read. Temporal and AWS Step Functions work too. The point is that an agent needs a real runtime, not a cron job on a server, or partial failures will cost you.

How do I add observability to an AI agent?

Use Langfuse or a similar tracing tool from day one. Every LLM call, every tool call, every decision should be logged with inputs, outputs, and timing. When the agent does something wrong, the trace tells you what it saw, what it decided, and what it did. Without that, you cannot debug a production agent.

When should an AI agent escalate to a human?

Build an escalation layer for two cases. Confidence-based: when the model is uncertain, do not act, draft for review. Rule-based: any action with high impact, like sending money, contacting a VIP, or modifying production data, escalates regardless of confidence. The human is a designed-in layer, not a fallback for when things break.

Genevieve Claire

Founder, Formaum — Claude Code Expert & Full-Stack AI Engineer

Builds bespoke AI automation systems for multi-location operations. Previously EA Sports FIFA ($7B franchise) and Film/TV VFX on Skyfall, Avengers, Game of Thrones. Based in Vancouver, BC.