Software ArchitectureAI Engineering

Agentic AI Architecture: Memory, Tools, Control Loops & Multi-Agent Orchestration

TT
TopicTrick Team
Agentic AI Architecture: Memory, Tools, Control Loops & Multi-Agent Orchestration

Agentic AI Architecture: Memory, Tools, Control Loops & Multi-Agent Orchestration


Table of Contents


What Makes a System "Agentic"?

The word "agentic" describes a spectrum of autonomy:

LevelDescriptionExample
0 — StaticSingle prompt → single responseChatGPT one-shot answer
1 — Tool UseLLM calls tools, gets results, respondsLLM calls a weather API
2 — RAGLLM retrieves context before respondingLLM searches documents
3 — Multi-StepLLM plans and executes multiple stepsLLM researches, drafts, reviews
4 — AutonomousLLM decides its own tool sequence, recovers from errorsAI coding agent that writes, tests, debugs
5 — Multi-AgentMultiple specialised LLMs collaborate with handoffsResearcher + Coder + Reviewer agents

Level 3+ requires deliberate architectural design — naive implementations at this level hallucinate tool calls, loop infinitely, and produce unauditable results.


The ReAct Control Loop: Think → Act → Observe

ReAct (Reasoning + Acting) is the foundational pattern for agentic control:

python
from anthropic import Anthropic

client = Anthropic()

def run_agent(task: str, tools: list[dict], max_iterations: int = 10) -> str:
    messages = [{"role": "user", "content": task}]
    iteration = 0
    
    while iteration < max_iterations:
        iteration += 1
        
        # THINK: LLM reasons about what to do next
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )
        
        # Check stop condition:
        if response.stop_reason == "end_turn":
            # Extract text response — agent is done
            return next(b.text for b in response.content if hasattr(b, 'text'))
        
        # ACT: Extract tool calls from response
        tool_uses = [b for b in response.content if b.type == "tool_use"]
        
        if not tool_uses:
            break  # No tools called, no end — unexpected state
        
        # Append assistant's reasoning + tool calls to conversation:
        messages.append({"role": "assistant", "content": response.content})
        
        # OBSERVE: Execute each tool and collect results
        tool_results = []
        for tool_use in tool_uses:
            result = execute_tool(tool_use.name, tool_use.input)  # Your tool router
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": tool_use.id,
                "content": str(result),
            })
        
        # Feed observations back to LLM:
        messages.append({"role": "user", "content": tool_results})
    
    # Max iterations reached — return partial result with warning
    return f"[Agent reached max_iterations={max_iterations}] Last state: {messages[-1]}"

The max_iterations guard is non-negotiable — without it, infinite loops burn tokens and money.


The Four Memory Layers

Effective agents require different memory systems for different timescales:

Practical implementation:

python
class AgentMemory:
    def __init__(self, vector_store, session_store):
        self.vector_store = vector_store   # Long-term semantic memory (Pinecone/Chroma)
        self.session_store = session_store  # Episodic memory (Redis/database)
        self.working_memory: list = []     # In-context window (messages list)
    
    def remember(self, content: str, metadata: dict):
        """Store to long-term semantic memory."""
        embedding = embed(content)
        self.vector_store.upsert(embedding, content, metadata)
    
    def recall(self, query: str, k: int = 5) -> list[str]:
        """Retrieve relevant long-term memories."""
        embedding = embed(query)
        results = self.vector_store.query(embedding, k=k)
        return [r.content for r in results]
    
    def inject_relevant_context(self, task: str) -> str:
        """Build context-enriched system prompt from long-term memory."""
        memories = self.recall(task)
        if memories:
            return f"Relevant past context:\n" + "\n".join(f"- {m}" for m in memories)
        return ""

Tool Calling: Architecture and Security

Tools are the agent's interface to the real world. Every tool must be:

  1. Defined with a precise schema (the LLM must understand inputs/outputs)
  2. Idempotent where possible (safe to retry on timeout)
  3. Sandboxed (execution cannot escape its security boundary)
python
# Tool schema (sent to the LLM):
tools = [
    {
        "name": "run_python",
        "description": "Execute Python code and return stdout/stderr. Use for calculations, data processing.",
        "input_schema": {
            "type": "object",
            "properties": {
                "code": {
                    "type": "string",
                    "description": "Python code to execute. No imports of os, subprocess, sys."
                }
            },
            "required": ["code"]
        }
    },
    {
        "name": "web_search",
        "description": "Search the web for current information. Returns top 5 result snippets.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "max_results": {"type": "integer", "default": 5}
            },
            "required": ["query"]
        }
    }
]

# Sandboxed code execution (using E2B or Firecracker VMs):
import e2b

def execute_tool(name: str, inputs: dict) -> str:
    if name == "run_python":
        with e2b.Sandbox() as sandbox:
            # Runs in isolated microVM — cannot access host filesystem
            result = sandbox.process.start_and_wait(
                f"python3 -c '{inputs['code']}'"
            )
            return result.stdout or result.stderr
    elif name == "web_search":
        return search_api.query(inputs["query"], inputs.get("max_results", 5))
    raise ValueError(f"Unknown tool: {name}")

The Model Context Protocol (MCP)

MCP (Anthropic, 2024) is an open standard for connecting LLM agents to tools, data sources, and services through a unified protocol:

text
Without MCP:                        With MCP:
Each agent integrates tools         Agent connects to MCP servers
individually (bespoke code):        (standardised protocol):

Agent → custom Slack code           Agent → MCP Client → MCP Server (Slack)
Agent → custom DB code              Agent → MCP Client → MCP Server (PostgreSQL)
Agent → custom Calendar code        Agent → MCP Client → MCP Server (Google Calendar)

Tools are not reusable across        Any MCP-compatible agent can use
agents or frameworks                any MCP server — plug and play

In 2026, major IDEs (VS Code, JetBrains), cloud providers, and SaaS tools publish MCP servers, creating an ecosystem of standardised agent tools.


Multi-Agent Patterns: Supervisor and Swarm

Supervisor Pattern: A coordinator agent routes tasks to specialised subagents:

python
# Supervisor decides which specialist to call:
AGENTS = {
    "researcher":  ResearchAgent(),   # Web search + summarisation
    "coder":       CoderAgent(),      # Code generation + execution
    "writer":      WriterAgent(),     # Long-form content generation
    "reviewer":    ReviewerAgent(),   # Quality checking + critique
}

def supervisor(task: str) -> str:
    # Supervisor LLM decides the workflow:
    plan = supervisor_llm.plan(task)  
    # e.g.: ["researcher", "coder", "reviewer"]
    
    context = task
    for agent_name in plan:
        agent = AGENTS[agent_name]
        result = agent.run(context)
        context = f"{context}\n\nAgent ({agent_name}) output:\n{result}"
    
    return context

Swarm Pattern: Agents autonomously hand off to each other based on context — no central coordinator:

python
# Each agent decides if it should transfer to another agent:
def research_agent(messages, context):
    response = llm.call(messages, tools=[web_search, transfer_to_coder])
    if response.tool == "transfer_to_coder":
        return coder_agent(messages + [response], context)
    return response.content

Failure Modes and Reliability Engineering

Failure ModeCauseMitigation
Infinite loopAgent retries failing tool forevermax_iterations limit + exponential backoff
Hallucinated tool callsLLM invents tool names/parametersStrict tool schema validation before execution
Context window overflowLong tasks fill the context windowEpisodic memory summarisation, sliding window
Goal driftAgent forgets original task after many stepsInject original goal into every iteration prompt
Irreversible actionsAgent deletes files, sends emailsConfirmation step for destructive tools
Token cost explosionComplex tasks with many tool callsBudget limits (max_tokens_per_task), cost alerts

Frequently Asked Questions

When should I use an agent vs a fixed workflow? Use a fixed workflow (LangGraph conditional edges, prompt chaining) when the steps are known and the sequence is predictable. Use an agent when the steps are unknown upfront and the LLM needs to decide its own strategy based on intermediate results. Agents are powerful but expensive and hard to debug — always prefer the simpler fixed workflow unless genuine autonomy is required.

How do I evaluate if my agent is actually working correctly? Agentic evaluation requires trajectory evaluation — not just the final answer, but whether the agent took reasonable steps to get there. Tools like LangSmith, Braintrust, or custom evaluation harnesses let you record agent traces (all tool calls, reasoning steps, observations) and score them. Minimum viable evaluation: a test suite of representative tasks with expected outcomes, run after every code change.


Key Takeaway

Agentic AI architecture is where software engineering and AI research converge. The ReAct loop, multi-layer memory, sandboxed tool execution, and multi-agent orchestration are not optional refinements — they are load-bearing architectural components that determine whether your agent is reliable or a liability. As LLMs become more capable in 2026, the bottleneck shifts from model intelligence to system design: the quality of your tool schemas, your memory retrieval strategy, your failure handling, and your evaluation framework are what separate production-grade agents from impressive demos.

Read next: RAG Architecture Patterns: Building Knowledge-Grounded AI →


Part of the Software Architecture Hub — comprehensive guides from architectural foundations to advanced distributed systems patterns.