AI Agents¶
An AI Agent is an autonomous system that perceives its environment, reasons about its goals, makes decisions, and takes actions to achieve those goals. Unlike simple LLMs that generate text in response to a prompt, agents can interact with tools, APIs, databases, file systems, and other systems—enabling complex, multi-step workflows like researching topics, writing and executing code, managing infrastructure, or orchestrating business processes.
The emergence of capable LLMs as reasoning engines has transformed AI agents from academic curiosities (simple rule-based bots) into practical production systems. This chapter covers agent architectures, tool use, the Model Context Protocol (MCP), memory systems, planning strategies, multi-agent collaboration, and production best practices.
1. Agent Fundamentals¶
What Makes an Agent?¶
An agent is more than just an LLM. The key distinction is the action loop — an agent iteratively reasons and acts until a goal is achieved:
| Component | LLM (Chat) | AI Agent |
|---|---|---|
| Input | Single prompt | Goal + environment state |
| Processing | Single forward pass | Multiple reasoning + action cycles |
| Output | Text response | Actions + observations + final answer |
| Tools | None | Functions, APIs, code execution |
| Memory | Context window only | Short-term + long-term memory |
| Autonomy | Responds to each prompt | Operates independently until goal met |
Agent Characteristics¶
- Autonomy: Operates independently with minimal human intervention. Can decide what steps to take, which tools to use, and when to ask for help.
- Reactivity: Responds to changes in the environment (new data, tool failures, user corrections).
- Proactiveness: Takes initiative to achieve goals, anticipating what information or actions are needed.
- Tool Use: Interacts with external systems through well-defined interfaces.
- Memory: Maintains context across interactions, learns from past experiences.
- Reflection: Evaluates its own performance and adjusts strategies.
The Agent Loop¶
Every agent follows a variation of this fundamental loop:
function agent_loop(goal: str, max_steps: int) -> str
state = initialize_state(goal)
for step in range(max_steps):
// 1. PERCEIVE: Gather current context
context = gather_context(state, memory, environment)
// 2. REASON: Decide what to do next
thought, action = llm.reason(context)
// 3. ACT: Execute the chosen action
if action.type == "final_answer":
return action.content
observation = execute_action(action)
// 4. OBSERVE: Process the result
state.add(thought, action, observation)
// 5. REFLECT: Evaluate progress (optional)
if should_reflect(state):
reflection = llm.reflect(state)
state.add_reflection(reflection)
return "Goal not achieved within step limit."
2. Agent Architectures¶
ReAct (Reasoning + Acting)¶
The most widely used agent pattern. Interleaves reasoning traces with tool actions:
Thought: I need to find the current weather in Paris.
Action: search_web("current weather in Paris")
Observation: Temperature: 15°C, Partly cloudy, Humidity: 65%
Thought: I now have the weather information. Let me format a response.
Action: final_answer("The current weather in Paris is 15°C and partly cloudy with 65% humidity.")
Pseudocode (ReAct Agent):
class ReActAgent:
llm: LanguageModel
tools: dict[str, Tool]
system_prompt: str
function execute(goal: str, max_iterations: int = 10) -> str
messages = [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": goal}
]
for iteration in range(max_iterations):
// LLM generates thought + action (or final answer)
response = self.llm.generate(messages, tools=self.tools)
if response.has_tool_call:
// Execute the tool
tool_name = response.tool_call.name
tool_args = response.tool_call.arguments
tool = self.tools[tool_name]
try:
observation = tool.execute(**tool_args)
except Exception as e:
observation = f"Error: {str(e)}"
// Add action + observation to conversation
messages.append({"role": "assistant", "content": response.text, "tool_call": response.tool_call})
messages.append({"role": "tool", "content": observation, "tool_call_id": response.tool_call.id})
else:
// Final answer
return response.text
return "Maximum iterations reached."
Plan-and-Execute¶
Separates planning from execution. First create a plan, then execute each step:
class PlanAndExecuteAgent:
planner_llm: LanguageModel // Creates the plan
executor_llm: LanguageModel // Executes each step
tools: dict[str, Tool]
function execute(goal: str) -> str
// Step 1: Create a plan
plan = self.planner_llm.generate(
f"Create a step-by-step plan to achieve: {goal}\n"
f"Available tools: {list(self.tools.keys())}\n"
f"Return a numbered list of steps."
)
steps = parse_plan(plan)
// Step 2: Execute each step
results = []
for i, step in enumerate(steps):
context = f"Goal: {goal}\nPlan: {plan}\n"
context += f"Completed: {results}\n"
context += f"Current step ({i+1}/{len(steps)}): {step}"
result = self.execute_step(step, context)
results.append(result)
// Re-plan if needed (step failed or new information)
if should_replan(result, steps[i+1:]):
remaining_steps = self.replan(goal, results, steps[i+1:])
steps = steps[:i+1] + remaining_steps
// Step 3: Synthesize final answer
return self.synthesize(goal, results)
Benefits: Better for complex, multi-step tasks. The plan provides a roadmap that can be adjusted. Drawbacks: Planning overhead, plan may become stale as new information emerges.
Reflexion¶
Agent reflects on past actions and failures to improve future performance:
class ReflexionAgent:
llm: LanguageModel
tools: dict[str, Tool]
reflection_memory: list[str] // Past reflections
function execute(goal: str, max_trials: int = 3) -> str
for trial in range(max_trials):
// Include past reflections in context
context = f"Goal: {goal}\n"
if self.reflection_memory:
context += f"Lessons from past attempts:\n"
for r in self.reflection_memory:
context += f"- {r}\n"
// Execute with ReAct
result = self.react_execute(goal, context)
// Evaluate result
if self.evaluate(goal, result):
return result
// Reflect on failure
reflection = self.llm.generate(
f"Your attempt to achieve '{goal}' produced: {result}\n"
f"This was not satisfactory. Reflect on what went wrong "
f"and what you should do differently next time."
)
self.reflection_memory.append(reflection)
return "Failed after maximum trials."
Architecture Comparison¶
| Architecture | Planning | Reflection | Best For | Complexity |
|---|---|---|---|---|
| ReAct | Implicit (step-by-step) | None | Simple to moderate tasks | Low |
| Plan-and-Execute | Explicit upfront plan | Re-planning | Complex multi-step tasks | Medium |
| Reflexion | Implicit + learned | Explicit reflection loop | Tasks requiring iteration | Medium |
| Tree of Thoughts | Explores multiple paths | Evaluates branches | Problems with multiple strategies | High |
| LATS (Language Agent Tree Search) | Monte Carlo tree search | Value function | Optimal action sequences | Very High |
3. Tools and Function Calling¶
Agents interact with the world through tools — functions that can be invoked to perform actions or retrieve information.
Function Calling¶
Modern LLMs support structured function calling, where the model outputs a structured JSON object specifying which function to call and with what arguments.
Tool Definition (OpenAI/Anthropic format):
{
"tools": [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web for current information on any topic. Use this when you need up-to-date information that might not be in your training data.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Be specific and include relevant keywords."
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return",
"default": 5
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "execute_python",
"description": "Execute Python code in a sandboxed environment. Use for calculations, data processing, or generating visualizations.",
"parameters": {
"type": "object",
"properties": {
"code": {
"type": "string",
"description": "Python code to execute"
}
},
"required": ["code"]
}
}
}
]
}
Common Tool Categories¶
| Category | Examples | Use Cases |
|---|---|---|
| Web Search | Tavily, Serper, Brave Search API | Current information, fact-checking |
| Code Execution | Python interpreter, Jupyter, sandboxed environments | Calculations, data processing, testing |
| File Operations | Read/write files, directory listing | Document processing, code modification |
| Database | SQL queries, vector search, graph queries | Data retrieval, knowledge base access |
| API Integration | REST/GraphQL calls, webhooks | External service interaction |
| Browser | Playwright, Puppeteer, browser automation | Web scraping, form filling, testing |
| Communication | Email, Slack, Teams APIs | Notifications, collaboration |
| Computation | Calculator, unit conversion, statistics | Mathematical operations |
Tool Execution Safety¶
class SafeToolExecutor:
allowed_tools: set[str]
confirmation_required: set[str] // Tools that need human approval
rate_limiter: RateLimiter
sandbox: Sandbox
function execute(tool_call: ToolCall) -> str
// 1. Validate tool exists and is allowed
if tool_call.name not in self.allowed_tools:
return f"Error: Tool '{tool_call.name}' is not available."
// 2. Rate limiting
if not self.rate_limiter.allow(tool_call.name):
return "Error: Rate limit exceeded. Try again later."
// 3. Validate parameters
if not validate_params(tool_call.name, tool_call.arguments):
return f"Error: Invalid parameters for {tool_call.name}."
// 4. Human confirmation for sensitive tools
if tool_call.name in self.confirmation_required:
approved = request_human_approval(tool_call)
if not approved:
return "Action cancelled by user."
// 5. Execute in sandbox
try:
result = self.sandbox.execute(tool_call.name, tool_call.arguments)
return truncate_if_needed(result, max_length=10000)
except TimeoutError:
return "Error: Tool execution timed out."
except Exception as e:
return f"Error: {str(e)}"
Tool Design Best Practices¶
- Clear descriptions: Tool descriptions are part of the LLM prompt. Write them as if explaining to a colleague. Include examples of when to use the tool.
- Well-defined schemas: Use precise JSON Schema with descriptions for each parameter. Constrain types and provide defaults.
- Error messages: Return clear, actionable error messages so the agent can recover.
- Idempotency: Where possible, make tools idempotent (safe to retry).
- Output format: Return structured data that's easy for the LLM to parse. Truncate large outputs.
- Granularity: Prefer specific, focused tools over monolithic ones.
search_documentationis better thando_anything.
4. Model Context Protocol (MCP)¶
Model Context Protocol (MCP) is an open standard (introduced by Anthropic in 2024) that standardizes how AI applications interact with external tools, data sources, and systems. MCP provides a universal interface that replaces the need for custom integrations with each tool or data source.
Why MCP?¶
Without MCP, every AI application must build custom integrations:
Before MCP:
App A → Custom integration → Tool 1
App A → Custom integration → Tool 2
App B → Different integration → Tool 1
App B → Different integration → Tool 2
(N apps × M tools = N×M integrations)
With MCP:
App A → MCP Client → MCP Server → Tool 1
App B → MCP Client → MCP Server → Tool 2
(N apps × 1 protocol + M servers = N + M integrations)
MCP Architecture¶
MCP uses a client-server model with three roles:
- MCP Host: The AI application (e.g., Claude Desktop, Cursor, custom app) that serves as the user-facing interface.
- MCP Client: Resides within the host, manages connections to MCP servers, translates between the host's needs and the MCP protocol.
- MCP Server: A lightweight program that exposes capabilities through the standard MCP interface. Each server provides access to specific tools or data sources.
MCP Capabilities¶
MCP servers can expose three types of capabilities:
1. Resources¶
Read-only data sources that the client can access:
{
"resources": [
{
"uri": "file:///path/to/codebase",
"name": "Project Source Code",
"description": "The current project's source code repository",
"mimeType": "text/plain"
},
{
"uri": "postgres://localhost/mydb/users",
"name": "Users Table",
"description": "Database table containing user information"
}
]
}
Resources support:
- Static resources: Known URIs listed upfront.
- Dynamic resources: Resources discovered via templates (e.g., file:///{path}).
- Subscriptions: Clients can subscribe to resource changes.
2. Tools¶
Functions that the server can execute:
{
"tools": [
{
"name": "query_database",
"description": "Execute a read-only SQL query against the database",
"inputSchema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "SQL SELECT query to execute"
}
},
"required": ["query"]
}
}
]
}
3. Prompts¶
Pre-defined prompt templates that guide LLM interactions:
{
"prompts": [
{
"name": "code_review",
"description": "Review code for bugs and improvements",
"arguments": [
{
"name": "code",
"description": "The code to review",
"required": true
},
{
"name": "language",
"description": "Programming language",
"required": false
}
]
}
]
}
MCP Transport¶
MCP supports two transport mechanisms:
| Transport | Protocol | Use Case |
|---|---|---|
| stdio | Standard input/output | Local servers, same machine |
| HTTP + SSE | HTTP for requests, Server-Sent Events for notifications | Remote servers, network access |
MCP Server Example¶
A simple MCP server that provides file system access:
class FileSystemMCPServer:
name: str = "filesystem"
version: str = "1.0.0"
root_path: str
// List available tools
function list_tools() -> list[Tool]
return [
Tool(
name="read_file",
description="Read the contents of a file",
input_schema={"path": {"type": "string", "description": "File path relative to root"}}
),
Tool(
name="write_file",
description="Write content to a file",
input_schema={
"path": {"type": "string"},
"content": {"type": "string"}
}
),
Tool(
name="list_directory",
description="List files in a directory",
input_schema={"path": {"type": "string"}}
)
]
// List available resources
function list_resources() -> list[Resource]
return [
Resource(
uri=f"file://{self.root_path}",
name="Project Root",
description="Root directory of the project"
)
]
// Execute a tool
function call_tool(name: str, arguments: dict) -> str
match name:
case "read_file":
path = resolve_path(self.root_path, arguments["path"])
validate_path_within_root(path, self.root_path) // Security!
return read_file(path)
case "write_file":
path = resolve_path(self.root_path, arguments["path"])
validate_path_within_root(path, self.root_path)
write_file(path, arguments["content"])
return "File written successfully."
case "list_directory":
path = resolve_path(self.root_path, arguments["path"])
return list_files(path)
MCP vs. Traditional Function Calling¶
| Aspect | Traditional Function Calling | MCP |
|---|---|---|
| Scope | Single application | Cross-application, standardized |
| Discovery | Hardcoded tool definitions | Dynamic server/tool discovery |
| Security | App-specific implementation | Protocol-level security and permissions |
| Extensibility | Requires code changes | Add new servers without app changes |
| Ecosystem | Fragmented, vendor-specific | Open standard, growing ecosystem |
| State Management | Custom per tool | Standardized resource subscriptions |
MCP Ecosystem¶
Growing ecosystem of MCP servers:
- File systems: Local and remote file access.
- Databases: PostgreSQL, SQLite, MongoDB connectors.
- APIs: GitHub, Slack, Jira, Google Drive.
- Search: Web search, documentation search.
- Development: Git operations, code analysis, testing.
- Knowledge: Wikipedia, Arxiv, documentation sites.
5. Memory Systems¶
Agents need memory to maintain context across interactions, learn from experience, and handle tasks that span multiple sessions.
Memory Types¶
Short-term Memory (Working Memory)¶
Information available during the current task:
- Conversation History: Recent messages between user and agent.
- Scratchpad: Intermediate results, thoughts, observations from current task.
- Context Window: Limited by the LLM's context window (2K-200K tokens).
Challenge: Context windows are finite. As conversations grow, older information must be compressed or dropped.
Long-term Memory¶
Persistent information across sessions:
- Episodic Memory: Records of past interactions and their outcomes. "Last time the user asked about X, they wanted Y format."
- Semantic Memory: Factual knowledge and learned patterns. Stored in vector databases for retrieval.
- Procedural Memory: Learned workflows and tool usage patterns. "To deploy code, first run tests, then create a PR, then merge."
Memory Implementation Patterns¶
class AgentMemory:
// Short-term
conversation: list[Message]
scratchpad: list[str]
// Long-term
episodic_store: VectorDatabase // Past interactions
semantic_store: VectorDatabase // Facts and knowledge
entity_store: dict[str, EntityInfo] // Tracked entities
function add_interaction(user_msg: str, agent_response: str, outcome: str)
// Add to conversation (short-term)
self.conversation.append(Message("user", user_msg))
self.conversation.append(Message("assistant", agent_response))
// Store in episodic memory (long-term)
episode = f"User asked: {user_msg}\nAgent did: {agent_response}\nOutcome: {outcome}"
embedding = embed(episode)
self.episodic_store.add(embedding, metadata={"timestamp": now(), "outcome": outcome})
// Extract and update entities
entities = extract_entities(user_msg + agent_response)
for entity in entities:
self.entity_store[entity.name] = entity
function get_relevant_context(current_query: str, max_tokens: int) -> str
context_parts = []
remaining_tokens = max_tokens
// 1. Recent conversation (highest priority)
recent = self.conversation[-10:]
recent_text = format_messages(recent)
context_parts.append(("Recent conversation", recent_text))
remaining_tokens -= count_tokens(recent_text)
// 2. Relevant past episodes
query_embedding = embed(current_query)
episodes = self.episodic_store.search(query_embedding, top_k=5)
episode_text = format_episodes(episodes)
if count_tokens(episode_text) < remaining_tokens * 0.3:
context_parts.append(("Relevant past interactions", episode_text))
remaining_tokens -= count_tokens(episode_text)
// 3. Relevant entities
mentioned_entities = extract_entities(current_query)
entity_text = format_entities(mentioned_entities, self.entity_store)
context_parts.append(("Known entities", entity_text))
return compile_context(context_parts)
function summarize_if_needed()
// Compress old conversation history
if count_tokens(self.conversation) > MAX_CONVERSATION_TOKENS:
old_messages = self.conversation[:-10]
summary = llm.generate(f"Summarize this conversation:\n{format_messages(old_messages)}")
self.conversation = [Message("system", f"Previous conversation summary: {summary}")] + self.conversation[-10:]
Memory Management Strategies¶
| Strategy | Approach | Pros | Cons |
|---|---|---|---|
| Full History | Keep all messages in context | Complete context | Exceeds token limits quickly |
| Sliding Window | Keep last N messages | Simple, predictable | Loses old context |
| Summary + Recent | Summarize old, keep recent detailed | Balance of context and recency | Summary may lose details |
| RAG Memory | Embed and retrieve relevant history | Scales indefinitely | Retrieval may miss relevant context |
| Entity Memory | Track key entities and their states | Efficient, structured | Limited to entity-centric info |
| Hybrid | Combine multiple strategies | Best overall | Most complex |
6. Planning and Reasoning¶
Chain-of-Thought (CoT)¶
Encourage the model to reason step-by-step before acting:
System: Think through problems step by step before taking action.
User: How many R's are in "strawberry"?
Agent:
Thought: Let me count the R's in "strawberry" letter by letter.
s-t-r-a-w-b-e-r-r-y
Position 3: r
Position 8: r
Position 9: r
There are 3 R's in "strawberry".
Tree of Thoughts (ToT)¶
Explore multiple reasoning paths and evaluate which is most promising:
class TreeOfThoughts:
llm: LanguageModel
evaluator: LanguageModel
branching_factor: int = 3
max_depth: int = 4
function solve(problem: str) -> str
root = ThoughtNode(state=problem)
best_solution = self.search(root, depth=0)
return best_solution
function search(node: ThoughtNode, depth: int) -> str
if depth >= self.max_depth:
return node.state
// Generate multiple next thoughts
candidates = []
for _ in range(self.branching_factor):
next_thought = self.llm.generate(
f"Given the current state:\n{node.state}\n"
f"What is a promising next step?"
)
candidates.append(ThoughtNode(state=node.state + "\n" + next_thought))
// Evaluate candidates
scores = []
for candidate in candidates:
score = self.evaluator.generate(
f"Rate the promise of this reasoning path (1-10):\n{candidate.state}"
)
scores.append(float(score))
// Pursue the best candidate
best_idx = argmax(scores)
return self.search(candidates[best_idx], depth + 1)
Task Decomposition¶
Break complex goals into manageable sub-tasks:
function decompose_task(goal: str) -> list[SubTask]
prompt = f"""Break down this goal into a sequence of concrete, actionable sub-tasks.
Each sub-task should be achievable with available tools.
Goal: {goal}
Sub-tasks (numbered list):"""
plan = llm.generate(prompt)
return parse_subtasks(plan)
Reasoning Patterns Comparison¶
| Pattern | Description | Strengths | When to Use |
|---|---|---|---|
| Direct | Single LLM call | Fast, cheap | Simple tasks |
| CoT | Step-by-step reasoning | Better accuracy | Math, logic, analysis |
| ReAct | Reasoning + tool use | Grounded in real data | Tasks needing external info |
| ToT | Explore multiple paths | Finds optimal solutions | Creative problem solving |
| Reflexion | Learn from failures | Improves over trials | Complex, error-prone tasks |
| Plan-Execute | Upfront planning | Structured approach | Multi-step workflows |
7. Multi-Agent Systems¶
Why Multiple Agents?¶
Complex tasks often benefit from specialization. Multiple agents can:
- Divide labor: Each agent specializes in a domain (coding, research, review).
- Parallel execution: Independent sub-tasks run simultaneously.
- Debate and verify: Agents check each other's work.
- Scale: Add agents as complexity grows.
Multi-Agent Patterns¶
Hierarchical (Manager-Worker)¶
A manager agent delegates tasks to specialized worker agents:
class ManagerAgent:
workers: dict[str, Agent] // role → agent
llm: LanguageModel
function execute(goal: str) -> str
// Plan and delegate
plan = self.llm.generate(
f"Break this goal into tasks and assign to available workers: "
f"{list(self.workers.keys())}\nGoal: {goal}"
)
assignments = parse_assignments(plan)
// Execute assignments
results = {}
for role, task in assignments:
worker = self.workers[role]
results[role] = worker.execute(task)
// Synthesize results
synthesis = self.llm.generate(
f"Synthesize these results into a final answer:\n"
f"Goal: {goal}\nResults: {results}"
)
return synthesis
Collaborative (Peer-to-Peer)¶
Agents communicate directly, sharing information and coordinating:
class CollaborativeSystem:
agents: list[Agent]
shared_memory: SharedMemory
function execute(goal: str, max_rounds: int = 5) -> str
self.shared_memory.set("goal", goal)
for round in range(max_rounds):
for agent in self.agents:
// Each agent reads shared state and contributes
context = self.shared_memory.get_relevant(agent.role)
contribution = agent.execute(context)
self.shared_memory.add(agent.role, contribution)
// Check if goal is achieved
if self.is_goal_achieved():
return self.shared_memory.get_final_answer()
return self.shared_memory.get_best_answer()
Debate / Adversarial¶
Agents argue different positions, and a judge selects the best:
class DebateSystem:
proposer: Agent
critic: Agent
judge: Agent
function execute(question: str) -> str
// Proposer generates initial answer
proposal = self.proposer.execute(f"Answer: {question}")
// Critic challenges the answer
critique = self.critic.execute(
f"Critically evaluate this answer. Find flaws, missing info, or errors:\n"
f"Question: {question}\nAnswer: {proposal}"
)
// Proposer revises based on critique
revised = self.proposer.execute(
f"Revise your answer based on this critique:\n"
f"Original: {proposal}\nCritique: {critique}"
)
// Judge selects best answer
final = self.judge.execute(
f"Select the best answer and explain why:\n"
f"Original: {proposal}\nRevised: {revised}\nCritique: {critique}"
)
return final
Popular Agent Frameworks¶
| Framework | Language | Architecture | Key Features | Best For |
|---|---|---|---|---|
| LangChain | Python/JS | Modular chains + agents | Extensive tool library, memory, callbacks | General-purpose agents |
| LangGraph | Python/JS | Graph-based workflows | Stateful, cyclical agent graphs, checkpointing | Complex agent workflows |
| LlamaIndex | Python | Data framework + agents | RAG-focused, query engines, data connectors | RAG-centric agents |
| AutoGen | Python | Multi-agent conversations | Code execution, group chat, flexible patterns | Multi-agent collaboration |
| CrewAI | Python | Role-based agents | Tasks, roles, delegation, process management | Team-based automation |
| Semantic Kernel | C#/Python/Java | Enterprise SDK | Microsoft ecosystem, planners, plugins | Enterprise applications |
| Haystack | Python | Pipeline-based | Modular components, production-ready | Production RAG + agents |
| Pydantic AI | Python | Type-safe agents | Strong typing, validation, testing support | Type-safe agent development |
Framework Selection Guide¶
- Starting out: LangChain or CrewAI for rapid prototyping.
- Complex workflows: LangGraph for stateful, graph-based agent flows.
- RAG-focused: LlamaIndex for data-heavy applications.
- Multi-agent: AutoGen or CrewAI for collaborative agent systems.
- Enterprise: Semantic Kernel for Microsoft stack integration.
- Production: LangGraph or Haystack for robust, scalable deployments.
8. Agent Evaluation¶
Metrics¶
| Metric | What It Measures | How to Compute |
|---|---|---|
| Task Success Rate | % of goals fully achieved | Binary: goal met or not |
| Partial Completion | How much of the goal was achieved | Rubric-based scoring |
| Steps to Completion | Efficiency (fewer steps = better) | Count reasoning + action steps |
| Tool Accuracy | Correct tool selection and parameters | Compare to gold-standard tool calls |
| Cost | Total token usage and API costs | Sum prompt + completion tokens |
| Latency | Time from query to final answer | Wall-clock time |
| Error Recovery | Ability to recover from tool failures | Count successful recoveries |
Evaluation Approaches¶
- Unit tests: Test individual tool calls with expected inputs/outputs.
- Integration tests: Test complete agent workflows on predefined scenarios.
- Benchmarks: Use established benchmarks (SWE-bench for coding, WebArena for web tasks, GAIA for general agents).
- Human evaluation: Have humans rate agent outputs for quality, correctness, and helpfulness.
- LLM-as-judge: Use a strong LLM to evaluate agent performance at scale.
9. Production Best Practices¶
Reliability¶
- Max iterations: Always set a maximum number of agent steps to prevent infinite loops.
- Timeouts: Set timeouts on tool execution to prevent hanging.
- Retry logic: Implement retries with exponential backoff for transient failures.
- Fallbacks: When the agent fails, fall back to simpler approaches or ask for human help.
- Checkpointing: Save agent state periodically so long tasks can resume after failures.
Safety¶
- Tool permissions: Whitelist allowed tools per agent. Read-only agents shouldn't have write tools.
- Human-in-the-loop: Require approval for high-impact actions (database writes, financial transactions, emails).
- Sandboxing: Run code execution in isolated containers with resource limits.
- Input validation: Validate all tool inputs before execution. Prevent path traversal, SQL injection, etc.
- Output filtering: Check agent outputs for sensitive information, harmful content, or PII leakage.
- Audit logging: Log all agent actions, tool calls, and decisions for debugging and compliance.
Cost Control¶
- Model selection: Use smaller models for simple reasoning steps, larger models for complex decisions. Implement model routing.
- Token budgets: Set per-task and per-session token limits.
- Caching: Cache tool results and LLM responses for repeated operations.
- Batching: Batch similar tool calls when possible.
- Early termination: Stop when the agent has enough information to answer, don't over-retrieve.
Observability¶
- Structured logging: Log each step (thought, action, observation) with timestamps and metadata.
- Tracing: Use tools like LangSmith, Arize Phoenix, or OpenTelemetry for end-to-end traces.
- Metrics: Track success rate, latency, cost, and error rates per agent type.
- Dashboards: Visualize agent performance trends over time.
- Alerting: Alert on anomalies (sudden increase in failures, cost spikes, latency degradation).
Scaling¶
- Async execution: Run tool calls asynchronously when they're independent.
- Queue-based: Use message queues for long-running agent tasks.
- Stateless design: Store agent state externally (database, cache) so any worker can resume.
- Horizontal scaling: Run multiple agent instances behind a load balancer.
- Resource limits: Cap concurrent agent executions per user and globally.