Skip to content

AI Agents

An AI Agent is an autonomous system that perceives its environment, reasons about its goals, makes decisions, and takes actions to achieve those goals. Unlike simple LLMs that generate text in response to a prompt, agents can interact with tools, APIs, databases, file systems, and other systems—enabling complex, multi-step workflows like researching topics, writing and executing code, managing infrastructure, or orchestrating business processes.

The emergence of capable LLMs as reasoning engines has transformed AI agents from academic curiosities (simple rule-based bots) into practical production systems. This chapter covers agent architectures, tool use, the Model Context Protocol (MCP), memory systems, planning strategies, multi-agent collaboration, and production best practices.


1. Agent Fundamentals

What Makes an Agent?

An agent is more than just an LLM. The key distinction is the action loop — an agent iteratively reasons and acts until a goal is achieved:

Component LLM (Chat) AI Agent
Input Single prompt Goal + environment state
Processing Single forward pass Multiple reasoning + action cycles
Output Text response Actions + observations + final answer
Tools None Functions, APIs, code execution
Memory Context window only Short-term + long-term memory
Autonomy Responds to each prompt Operates independently until goal met

Agent Characteristics

  • Autonomy: Operates independently with minimal human intervention. Can decide what steps to take, which tools to use, and when to ask for help.
  • Reactivity: Responds to changes in the environment (new data, tool failures, user corrections).
  • Proactiveness: Takes initiative to achieve goals, anticipating what information or actions are needed.
  • Tool Use: Interacts with external systems through well-defined interfaces.
  • Memory: Maintains context across interactions, learns from past experiences.
  • Reflection: Evaluates its own performance and adjusts strategies.

The Agent Loop

Every agent follows a variation of this fundamental loop:

function agent_loop(goal: str, max_steps: int) -> str
    state = initialize_state(goal)

    for step in range(max_steps):
        // 1. PERCEIVE: Gather current context
        context = gather_context(state, memory, environment)

        // 2. REASON: Decide what to do next
        thought, action = llm.reason(context)

        // 3. ACT: Execute the chosen action
        if action.type == "final_answer":
            return action.content

        observation = execute_action(action)

        // 4. OBSERVE: Process the result
        state.add(thought, action, observation)

        // 5. REFLECT: Evaluate progress (optional)
        if should_reflect(state):
            reflection = llm.reflect(state)
            state.add_reflection(reflection)

    return "Goal not achieved within step limit."

2. Agent Architectures

ReAct (Reasoning + Acting)

The most widely used agent pattern. Interleaves reasoning traces with tool actions:

Thought: I need to find the current weather in Paris.
Action: search_web("current weather in Paris")
Observation: Temperature: 15°C, Partly cloudy, Humidity: 65%
Thought: I now have the weather information. Let me format a response.
Action: final_answer("The current weather in Paris is 15°C and partly cloudy with 65% humidity.")

Pseudocode (ReAct Agent):

class ReActAgent:
    llm: LanguageModel
    tools: dict[str, Tool]
    system_prompt: str

    function execute(goal: str, max_iterations: int = 10) -> str
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": goal}
        ]

        for iteration in range(max_iterations):
            // LLM generates thought + action (or final answer)
            response = self.llm.generate(messages, tools=self.tools)

            if response.has_tool_call:
                // Execute the tool
                tool_name = response.tool_call.name
                tool_args = response.tool_call.arguments
                tool = self.tools[tool_name]

                try:
                    observation = tool.execute(**tool_args)
                except Exception as e:
                    observation = f"Error: {str(e)}"

                // Add action + observation to conversation
                messages.append({"role": "assistant", "content": response.text, "tool_call": response.tool_call})
                messages.append({"role": "tool", "content": observation, "tool_call_id": response.tool_call.id})
            else:
                // Final answer
                return response.text

        return "Maximum iterations reached."

Plan-and-Execute

Separates planning from execution. First create a plan, then execute each step:

class PlanAndExecuteAgent:
    planner_llm: LanguageModel    // Creates the plan
    executor_llm: LanguageModel   // Executes each step
    tools: dict[str, Tool]

    function execute(goal: str) -> str
        // Step 1: Create a plan
        plan = self.planner_llm.generate(
            f"Create a step-by-step plan to achieve: {goal}\n"
            f"Available tools: {list(self.tools.keys())}\n"
            f"Return a numbered list of steps."
        )
        steps = parse_plan(plan)

        // Step 2: Execute each step
        results = []
        for i, step in enumerate(steps):
            context = f"Goal: {goal}\nPlan: {plan}\n"
            context += f"Completed: {results}\n"
            context += f"Current step ({i+1}/{len(steps)}): {step}"

            result = self.execute_step(step, context)
            results.append(result)

            // Re-plan if needed (step failed or new information)
            if should_replan(result, steps[i+1:]):
                remaining_steps = self.replan(goal, results, steps[i+1:])
                steps = steps[:i+1] + remaining_steps

        // Step 3: Synthesize final answer
        return self.synthesize(goal, results)

Benefits: Better for complex, multi-step tasks. The plan provides a roadmap that can be adjusted. Drawbacks: Planning overhead, plan may become stale as new information emerges.

Reflexion

Agent reflects on past actions and failures to improve future performance:

class ReflexionAgent:
    llm: LanguageModel
    tools: dict[str, Tool]
    reflection_memory: list[str]  // Past reflections

    function execute(goal: str, max_trials: int = 3) -> str
        for trial in range(max_trials):
            // Include past reflections in context
            context = f"Goal: {goal}\n"
            if self.reflection_memory:
                context += f"Lessons from past attempts:\n"
                for r in self.reflection_memory:
                    context += f"- {r}\n"

            // Execute with ReAct
            result = self.react_execute(goal, context)

            // Evaluate result
            if self.evaluate(goal, result):
                return result

            // Reflect on failure
            reflection = self.llm.generate(
                f"Your attempt to achieve '{goal}' produced: {result}\n"
                f"This was not satisfactory. Reflect on what went wrong "
                f"and what you should do differently next time."
            )
            self.reflection_memory.append(reflection)

        return "Failed after maximum trials."

Architecture Comparison

Architecture Planning Reflection Best For Complexity
ReAct Implicit (step-by-step) None Simple to moderate tasks Low
Plan-and-Execute Explicit upfront plan Re-planning Complex multi-step tasks Medium
Reflexion Implicit + learned Explicit reflection loop Tasks requiring iteration Medium
Tree of Thoughts Explores multiple paths Evaluates branches Problems with multiple strategies High
LATS (Language Agent Tree Search) Monte Carlo tree search Value function Optimal action sequences Very High

3. Tools and Function Calling

Agents interact with the world through tools — functions that can be invoked to perform actions or retrieve information.

Function Calling

Modern LLMs support structured function calling, where the model outputs a structured JSON object specifying which function to call and with what arguments.

Tool Definition (OpenAI/Anthropic format):

{
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "search_web",
                "description": "Search the web for current information on any topic. Use this when you need up-to-date information that might not be in your training data.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "The search query. Be specific and include relevant keywords."
                        },
                        "max_results": {
                            "type": "integer",
                            "description": "Maximum number of results to return",
                            "default": 5
                        }
                    },
                    "required": ["query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "execute_python",
                "description": "Execute Python code in a sandboxed environment. Use for calculations, data processing, or generating visualizations.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "code": {
                            "type": "string",
                            "description": "Python code to execute"
                        }
                    },
                    "required": ["code"]
                }
            }
        }
    ]
}

Common Tool Categories

Category Examples Use Cases
Web Search Tavily, Serper, Brave Search API Current information, fact-checking
Code Execution Python interpreter, Jupyter, sandboxed environments Calculations, data processing, testing
File Operations Read/write files, directory listing Document processing, code modification
Database SQL queries, vector search, graph queries Data retrieval, knowledge base access
API Integration REST/GraphQL calls, webhooks External service interaction
Browser Playwright, Puppeteer, browser automation Web scraping, form filling, testing
Communication Email, Slack, Teams APIs Notifications, collaboration
Computation Calculator, unit conversion, statistics Mathematical operations

Tool Execution Safety

class SafeToolExecutor:
    allowed_tools: set[str]
    confirmation_required: set[str]  // Tools that need human approval
    rate_limiter: RateLimiter
    sandbox: Sandbox

    function execute(tool_call: ToolCall) -> str
        // 1. Validate tool exists and is allowed
        if tool_call.name not in self.allowed_tools:
            return f"Error: Tool '{tool_call.name}' is not available."

        // 2. Rate limiting
        if not self.rate_limiter.allow(tool_call.name):
            return "Error: Rate limit exceeded. Try again later."

        // 3. Validate parameters
        if not validate_params(tool_call.name, tool_call.arguments):
            return f"Error: Invalid parameters for {tool_call.name}."

        // 4. Human confirmation for sensitive tools
        if tool_call.name in self.confirmation_required:
            approved = request_human_approval(tool_call)
            if not approved:
                return "Action cancelled by user."

        // 5. Execute in sandbox
        try:
            result = self.sandbox.execute(tool_call.name, tool_call.arguments)
            return truncate_if_needed(result, max_length=10000)
        except TimeoutError:
            return "Error: Tool execution timed out."
        except Exception as e:
            return f"Error: {str(e)}"

Tool Design Best Practices

  1. Clear descriptions: Tool descriptions are part of the LLM prompt. Write them as if explaining to a colleague. Include examples of when to use the tool.
  2. Well-defined schemas: Use precise JSON Schema with descriptions for each parameter. Constrain types and provide defaults.
  3. Error messages: Return clear, actionable error messages so the agent can recover.
  4. Idempotency: Where possible, make tools idempotent (safe to retry).
  5. Output format: Return structured data that's easy for the LLM to parse. Truncate large outputs.
  6. Granularity: Prefer specific, focused tools over monolithic ones. search_documentation is better than do_anything.

4. Model Context Protocol (MCP)

Model Context Protocol (MCP) is an open standard (introduced by Anthropic in 2024) that standardizes how AI applications interact with external tools, data sources, and systems. MCP provides a universal interface that replaces the need for custom integrations with each tool or data source.

Why MCP?

Without MCP, every AI application must build custom integrations:

Before MCP:
  App A → Custom integration → Tool 1
  App A → Custom integration → Tool 2
  App B → Different integration → Tool 1
  App B → Different integration → Tool 2
  (N apps × M tools = N×M integrations)

With MCP:
  App A → MCP Client → MCP Server → Tool 1
  App B → MCP Client → MCP Server → Tool 2
  (N apps × 1 protocol + M servers = N + M integrations)

MCP Architecture

MCP uses a client-server model with three roles:

  • MCP Host: The AI application (e.g., Claude Desktop, Cursor, custom app) that serves as the user-facing interface.
  • MCP Client: Resides within the host, manages connections to MCP servers, translates between the host's needs and the MCP protocol.
  • MCP Server: A lightweight program that exposes capabilities through the standard MCP interface. Each server provides access to specific tools or data sources.

MCP Capabilities

MCP servers can expose three types of capabilities:

1. Resources

Read-only data sources that the client can access:

{
    "resources": [
        {
            "uri": "file:///path/to/codebase",
            "name": "Project Source Code",
            "description": "The current project's source code repository",
            "mimeType": "text/plain"
        },
        {
            "uri": "postgres://localhost/mydb/users",
            "name": "Users Table",
            "description": "Database table containing user information"
        }
    ]
}

Resources support: - Static resources: Known URIs listed upfront. - Dynamic resources: Resources discovered via templates (e.g., file:///{path}). - Subscriptions: Clients can subscribe to resource changes.

2. Tools

Functions that the server can execute:

{
    "tools": [
        {
            "name": "query_database",
            "description": "Execute a read-only SQL query against the database",
            "inputSchema": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "SQL SELECT query to execute"
                    }
                },
                "required": ["query"]
            }
        }
    ]
}

3. Prompts

Pre-defined prompt templates that guide LLM interactions:

{
    "prompts": [
        {
            "name": "code_review",
            "description": "Review code for bugs and improvements",
            "arguments": [
                {
                    "name": "code",
                    "description": "The code to review",
                    "required": true
                },
                {
                    "name": "language",
                    "description": "Programming language",
                    "required": false
                }
            ]
        }
    ]
}

MCP Transport

MCP supports two transport mechanisms:

Transport Protocol Use Case
stdio Standard input/output Local servers, same machine
HTTP + SSE HTTP for requests, Server-Sent Events for notifications Remote servers, network access

MCP Server Example

A simple MCP server that provides file system access:

class FileSystemMCPServer:
    name: str = "filesystem"
    version: str = "1.0.0"
    root_path: str

    // List available tools
    function list_tools() -> list[Tool]
        return [
            Tool(
                name="read_file",
                description="Read the contents of a file",
                input_schema={"path": {"type": "string", "description": "File path relative to root"}}
            ),
            Tool(
                name="write_file",
                description="Write content to a file",
                input_schema={
                    "path": {"type": "string"},
                    "content": {"type": "string"}
                }
            ),
            Tool(
                name="list_directory",
                description="List files in a directory",
                input_schema={"path": {"type": "string"}}
            )
        ]

    // List available resources
    function list_resources() -> list[Resource]
        return [
            Resource(
                uri=f"file://{self.root_path}",
                name="Project Root",
                description="Root directory of the project"
            )
        ]

    // Execute a tool
    function call_tool(name: str, arguments: dict) -> str
        match name:
            case "read_file":
                path = resolve_path(self.root_path, arguments["path"])
                validate_path_within_root(path, self.root_path)  // Security!
                return read_file(path)
            case "write_file":
                path = resolve_path(self.root_path, arguments["path"])
                validate_path_within_root(path, self.root_path)
                write_file(path, arguments["content"])
                return "File written successfully."
            case "list_directory":
                path = resolve_path(self.root_path, arguments["path"])
                return list_files(path)

MCP vs. Traditional Function Calling

Aspect Traditional Function Calling MCP
Scope Single application Cross-application, standardized
Discovery Hardcoded tool definitions Dynamic server/tool discovery
Security App-specific implementation Protocol-level security and permissions
Extensibility Requires code changes Add new servers without app changes
Ecosystem Fragmented, vendor-specific Open standard, growing ecosystem
State Management Custom per tool Standardized resource subscriptions

MCP Ecosystem

Growing ecosystem of MCP servers:

  • File systems: Local and remote file access.
  • Databases: PostgreSQL, SQLite, MongoDB connectors.
  • APIs: GitHub, Slack, Jira, Google Drive.
  • Search: Web search, documentation search.
  • Development: Git operations, code analysis, testing.
  • Knowledge: Wikipedia, Arxiv, documentation sites.

5. Memory Systems

Agents need memory to maintain context across interactions, learn from experience, and handle tasks that span multiple sessions.

Memory Types

Short-term Memory (Working Memory)

Information available during the current task:

  • Conversation History: Recent messages between user and agent.
  • Scratchpad: Intermediate results, thoughts, observations from current task.
  • Context Window: Limited by the LLM's context window (2K-200K tokens).

Challenge: Context windows are finite. As conversations grow, older information must be compressed or dropped.

Long-term Memory

Persistent information across sessions:

  • Episodic Memory: Records of past interactions and their outcomes. "Last time the user asked about X, they wanted Y format."
  • Semantic Memory: Factual knowledge and learned patterns. Stored in vector databases for retrieval.
  • Procedural Memory: Learned workflows and tool usage patterns. "To deploy code, first run tests, then create a PR, then merge."

Memory Implementation Patterns

class AgentMemory:
    // Short-term
    conversation: list[Message]
    scratchpad: list[str]

    // Long-term
    episodic_store: VectorDatabase     // Past interactions
    semantic_store: VectorDatabase     // Facts and knowledge
    entity_store: dict[str, EntityInfo] // Tracked entities

    function add_interaction(user_msg: str, agent_response: str, outcome: str)
        // Add to conversation (short-term)
        self.conversation.append(Message("user", user_msg))
        self.conversation.append(Message("assistant", agent_response))

        // Store in episodic memory (long-term)
        episode = f"User asked: {user_msg}\nAgent did: {agent_response}\nOutcome: {outcome}"
        embedding = embed(episode)
        self.episodic_store.add(embedding, metadata={"timestamp": now(), "outcome": outcome})

        // Extract and update entities
        entities = extract_entities(user_msg + agent_response)
        for entity in entities:
            self.entity_store[entity.name] = entity

    function get_relevant_context(current_query: str, max_tokens: int) -> str
        context_parts = []
        remaining_tokens = max_tokens

        // 1. Recent conversation (highest priority)
        recent = self.conversation[-10:]
        recent_text = format_messages(recent)
        context_parts.append(("Recent conversation", recent_text))
        remaining_tokens -= count_tokens(recent_text)

        // 2. Relevant past episodes
        query_embedding = embed(current_query)
        episodes = self.episodic_store.search(query_embedding, top_k=5)
        episode_text = format_episodes(episodes)
        if count_tokens(episode_text) < remaining_tokens * 0.3:
            context_parts.append(("Relevant past interactions", episode_text))
            remaining_tokens -= count_tokens(episode_text)

        // 3. Relevant entities
        mentioned_entities = extract_entities(current_query)
        entity_text = format_entities(mentioned_entities, self.entity_store)
        context_parts.append(("Known entities", entity_text))

        return compile_context(context_parts)

    function summarize_if_needed()
        // Compress old conversation history
        if count_tokens(self.conversation) > MAX_CONVERSATION_TOKENS:
            old_messages = self.conversation[:-10]
            summary = llm.generate(f"Summarize this conversation:\n{format_messages(old_messages)}")
            self.conversation = [Message("system", f"Previous conversation summary: {summary}")] + self.conversation[-10:]

Memory Management Strategies

Strategy Approach Pros Cons
Full History Keep all messages in context Complete context Exceeds token limits quickly
Sliding Window Keep last N messages Simple, predictable Loses old context
Summary + Recent Summarize old, keep recent detailed Balance of context and recency Summary may lose details
RAG Memory Embed and retrieve relevant history Scales indefinitely Retrieval may miss relevant context
Entity Memory Track key entities and their states Efficient, structured Limited to entity-centric info
Hybrid Combine multiple strategies Best overall Most complex

6. Planning and Reasoning

Chain-of-Thought (CoT)

Encourage the model to reason step-by-step before acting:

System: Think through problems step by step before taking action.

User: How many R's are in "strawberry"?

Agent:
Thought: Let me count the R's in "strawberry" letter by letter.
s-t-r-a-w-b-e-r-r-y
Position 3: r
Position 8: r
Position 9: r
There are 3 R's in "strawberry".

Tree of Thoughts (ToT)

Explore multiple reasoning paths and evaluate which is most promising:

class TreeOfThoughts:
    llm: LanguageModel
    evaluator: LanguageModel
    branching_factor: int = 3
    max_depth: int = 4

    function solve(problem: str) -> str
        root = ThoughtNode(state=problem)
        best_solution = self.search(root, depth=0)
        return best_solution

    function search(node: ThoughtNode, depth: int) -> str
        if depth >= self.max_depth:
            return node.state

        // Generate multiple next thoughts
        candidates = []
        for _ in range(self.branching_factor):
            next_thought = self.llm.generate(
                f"Given the current state:\n{node.state}\n"
                f"What is a promising next step?"
            )
            candidates.append(ThoughtNode(state=node.state + "\n" + next_thought))

        // Evaluate candidates
        scores = []
        for candidate in candidates:
            score = self.evaluator.generate(
                f"Rate the promise of this reasoning path (1-10):\n{candidate.state}"
            )
            scores.append(float(score))

        // Pursue the best candidate
        best_idx = argmax(scores)
        return self.search(candidates[best_idx], depth + 1)

Task Decomposition

Break complex goals into manageable sub-tasks:

function decompose_task(goal: str) -> list[SubTask]
    prompt = f"""Break down this goal into a sequence of concrete, actionable sub-tasks.
Each sub-task should be achievable with available tools.

Goal: {goal}

Sub-tasks (numbered list):"""

    plan = llm.generate(prompt)
    return parse_subtasks(plan)

Reasoning Patterns Comparison

Pattern Description Strengths When to Use
Direct Single LLM call Fast, cheap Simple tasks
CoT Step-by-step reasoning Better accuracy Math, logic, analysis
ReAct Reasoning + tool use Grounded in real data Tasks needing external info
ToT Explore multiple paths Finds optimal solutions Creative problem solving
Reflexion Learn from failures Improves over trials Complex, error-prone tasks
Plan-Execute Upfront planning Structured approach Multi-step workflows

7. Multi-Agent Systems

Why Multiple Agents?

Complex tasks often benefit from specialization. Multiple agents can:

  • Divide labor: Each agent specializes in a domain (coding, research, review).
  • Parallel execution: Independent sub-tasks run simultaneously.
  • Debate and verify: Agents check each other's work.
  • Scale: Add agents as complexity grows.

Multi-Agent Patterns

Hierarchical (Manager-Worker)

A manager agent delegates tasks to specialized worker agents:

class ManagerAgent:
    workers: dict[str, Agent]  // role → agent
    llm: LanguageModel

    function execute(goal: str) -> str
        // Plan and delegate
        plan = self.llm.generate(
            f"Break this goal into tasks and assign to available workers: "
            f"{list(self.workers.keys())}\nGoal: {goal}"
        )
        assignments = parse_assignments(plan)

        // Execute assignments
        results = {}
        for role, task in assignments:
            worker = self.workers[role]
            results[role] = worker.execute(task)

        // Synthesize results
        synthesis = self.llm.generate(
            f"Synthesize these results into a final answer:\n"
            f"Goal: {goal}\nResults: {results}"
        )
        return synthesis

Collaborative (Peer-to-Peer)

Agents communicate directly, sharing information and coordinating:

class CollaborativeSystem:
    agents: list[Agent]
    shared_memory: SharedMemory

    function execute(goal: str, max_rounds: int = 5) -> str
        self.shared_memory.set("goal", goal)

        for round in range(max_rounds):
            for agent in self.agents:
                // Each agent reads shared state and contributes
                context = self.shared_memory.get_relevant(agent.role)
                contribution = agent.execute(context)
                self.shared_memory.add(agent.role, contribution)

            // Check if goal is achieved
            if self.is_goal_achieved():
                return self.shared_memory.get_final_answer()

        return self.shared_memory.get_best_answer()

Debate / Adversarial

Agents argue different positions, and a judge selects the best:

class DebateSystem:
    proposer: Agent
    critic: Agent
    judge: Agent

    function execute(question: str) -> str
        // Proposer generates initial answer
        proposal = self.proposer.execute(f"Answer: {question}")

        // Critic challenges the answer
        critique = self.critic.execute(
            f"Critically evaluate this answer. Find flaws, missing info, or errors:\n"
            f"Question: {question}\nAnswer: {proposal}"
        )

        // Proposer revises based on critique
        revised = self.proposer.execute(
            f"Revise your answer based on this critique:\n"
            f"Original: {proposal}\nCritique: {critique}"
        )

        // Judge selects best answer
        final = self.judge.execute(
            f"Select the best answer and explain why:\n"
            f"Original: {proposal}\nRevised: {revised}\nCritique: {critique}"
        )

        return final
Framework Language Architecture Key Features Best For
LangChain Python/JS Modular chains + agents Extensive tool library, memory, callbacks General-purpose agents
LangGraph Python/JS Graph-based workflows Stateful, cyclical agent graphs, checkpointing Complex agent workflows
LlamaIndex Python Data framework + agents RAG-focused, query engines, data connectors RAG-centric agents
AutoGen Python Multi-agent conversations Code execution, group chat, flexible patterns Multi-agent collaboration
CrewAI Python Role-based agents Tasks, roles, delegation, process management Team-based automation
Semantic Kernel C#/Python/Java Enterprise SDK Microsoft ecosystem, planners, plugins Enterprise applications
Haystack Python Pipeline-based Modular components, production-ready Production RAG + agents
Pydantic AI Python Type-safe agents Strong typing, validation, testing support Type-safe agent development

Framework Selection Guide

  • Starting out: LangChain or CrewAI for rapid prototyping.
  • Complex workflows: LangGraph for stateful, graph-based agent flows.
  • RAG-focused: LlamaIndex for data-heavy applications.
  • Multi-agent: AutoGen or CrewAI for collaborative agent systems.
  • Enterprise: Semantic Kernel for Microsoft stack integration.
  • Production: LangGraph or Haystack for robust, scalable deployments.

8. Agent Evaluation

Metrics

Metric What It Measures How to Compute
Task Success Rate % of goals fully achieved Binary: goal met or not
Partial Completion How much of the goal was achieved Rubric-based scoring
Steps to Completion Efficiency (fewer steps = better) Count reasoning + action steps
Tool Accuracy Correct tool selection and parameters Compare to gold-standard tool calls
Cost Total token usage and API costs Sum prompt + completion tokens
Latency Time from query to final answer Wall-clock time
Error Recovery Ability to recover from tool failures Count successful recoveries

Evaluation Approaches

  1. Unit tests: Test individual tool calls with expected inputs/outputs.
  2. Integration tests: Test complete agent workflows on predefined scenarios.
  3. Benchmarks: Use established benchmarks (SWE-bench for coding, WebArena for web tasks, GAIA for general agents).
  4. Human evaluation: Have humans rate agent outputs for quality, correctness, and helpfulness.
  5. LLM-as-judge: Use a strong LLM to evaluate agent performance at scale.

9. Production Best Practices

Reliability

  1. Max iterations: Always set a maximum number of agent steps to prevent infinite loops.
  2. Timeouts: Set timeouts on tool execution to prevent hanging.
  3. Retry logic: Implement retries with exponential backoff for transient failures.
  4. Fallbacks: When the agent fails, fall back to simpler approaches or ask for human help.
  5. Checkpointing: Save agent state periodically so long tasks can resume after failures.

Safety

  1. Tool permissions: Whitelist allowed tools per agent. Read-only agents shouldn't have write tools.
  2. Human-in-the-loop: Require approval for high-impact actions (database writes, financial transactions, emails).
  3. Sandboxing: Run code execution in isolated containers with resource limits.
  4. Input validation: Validate all tool inputs before execution. Prevent path traversal, SQL injection, etc.
  5. Output filtering: Check agent outputs for sensitive information, harmful content, or PII leakage.
  6. Audit logging: Log all agent actions, tool calls, and decisions for debugging and compliance.

Cost Control

  1. Model selection: Use smaller models for simple reasoning steps, larger models for complex decisions. Implement model routing.
  2. Token budgets: Set per-task and per-session token limits.
  3. Caching: Cache tool results and LLM responses for repeated operations.
  4. Batching: Batch similar tool calls when possible.
  5. Early termination: Stop when the agent has enough information to answer, don't over-retrieve.

Observability

  1. Structured logging: Log each step (thought, action, observation) with timestamps and metadata.
  2. Tracing: Use tools like LangSmith, Arize Phoenix, or OpenTelemetry for end-to-end traces.
  3. Metrics: Track success rate, latency, cost, and error rates per agent type.
  4. Dashboards: Visualize agent performance trends over time.
  5. Alerting: Alert on anomalies (sudden increase in failures, cost spikes, latency degradation).

Scaling

  1. Async execution: Run tool calls asynchronously when they're independent.
  2. Queue-based: Use message queues for long-running agent tasks.
  3. Stateless design: Store agent state externally (database, cache) so any worker can resume.
  4. Horizontal scaling: Run multiple agent instances behind a load balancer.
  5. Resource limits: Cap concurrent agent executions per user and globally.