AI Agent Development: Architecture, Testing, and Runbook

AI Agent Development: Architecture, Testing, and Runbook | Impulse Lab

Moving from an agent demo to reliable production capacity doesn't depend on the "best model." In 2026, the difference lies in a readable architecture, a reproducible testing strategy, and an operations runbook that anticipates incidents, variable costs, and security risks.

This article proposes a concrete framework for AI agent development in the context of SMEs and scale-ups, with reusable artifacts (reference architecture, test matrix, minimal runbook).

1) Prerequisites: Write an "Agent Contract" Before Architecting

Before talking components, define the contract. An agent isn't just a "smarter chat"; it's a system that observes, decides, and acts. Without a contract, your architecture bloats, tests are incomplete, and the runbook becomes impractical.

An agent contract fits on one page and establishes:

Goal (Business KPI and definition of success)
Scope (what the agent is allowed to handle, and what it must refuse)
Sources of Truth (documents, CRM, ERP, ticket base)
Authorized Actions (read-only, write, send, object creation)
Autonomy Levels (suggestion, execution with validation, automatic execution)
Failure Criteria (when the agent must hand off, or stop)
Audit Requirements (logs, traceability, justification, retention)

This contract becomes your reference for:

design (guardrails, permissions)
evaluation (test set, edge cases)
operations (alerts and procedures)

2) AI Agent Architecture: A "Production-First" Reference

A robust agent architecture is thought of as a mini-platform: the agent orchestrates, but tools, data, and guardrails remain separate. This reduces vendor lock-in, facilitates testing, and avoids coupling your business logic to prompts.

AI agent architecture diagram in an enterprise with 5 blocks: UI/Channel (chat, email, API), Agent Orchestrator, Context (RAG/sources), Tools (CRM/ERP/tickets), Guardrails and Observability (policies, logs, alerting). Arrows showing the flow "request → plan → actions → result", with a human validation point.

The Essential Bricks (and What They Protect)

Brick	Role in Production	Risks Covered	Expected Deliverable
Channel (UI/API)	Captures intent, structures inputs, manages identity	Ambiguous inputs, PII, UX errors	Entry contract (schema), input rules, consent
Agent Orchestrator	Plans, calls tools, manages state and retries	Loops, incoherent actions, costs	Graph/steps, state machine, timeouts, idempotency
Context (RAG, sources)	Provides verifiable, versioned facts	Hallucinations, obsolete info	Index, chunking strategy, citations, freshness policy
Tools Layer (connectors)	Encapsulates CRM/ERP/helpdesk/email	Integration errors, permissions	Internal SDK, mocks, contract tests, access scopes
Guardrails	Authorizes, refuses, redirects, limits	Prompt injection, exfiltration, dangerous actions	Policies, allowlists, PII filters, human validation
Observability	Logs, traces, quality and cost metrics	"Invisible" incidents, drift, spending

Two Rules That Simplify Everything

Rule 1: Separate "reasoning" and "acting". The agent can propose a plan, but execution must go through tooled functions with strict contracts (schemas, validations, rights). This limits free-form outputs and secures actions.

Rule 2: Everything that costs or breaks must be measured. Tokens, latency, tool failure rates, "human handoff" rates, and policy refusal rates are not details. They are your future incidents.

Integrations: The Tipping Point Between Demo and ROI

In many organizations, the agent becomes profitable when it integrates with "hard" systems (CRM, helpdesk, ERP) and reduces recurring friction: qualification, ticket creation, opportunity updates, follow-ups, standard request resolution.

If your context resembles a mid-market company or scale-up with a structuring ERP (e.g., NetSuite), look at integration- and ROI-oriented approaches, like those highlighted by a managed services firm specializing in AI and ERP: AI & NetSuite consulting for the mid-market. The interest here is not the "tool," but the discipline: short cycles, clean integrations, and ROI management.

3) Testing Strategy: What is Specific to Agents (and What Isn't)

An agent is a non-deterministic system, connected to tools, exposed to adversarial inputs, and with variable costs. Your tests must therefore cover:

quality (correct and useful answers)
security (injections, leaks, bypasses)
action reliability (idempotency, permissions, validation)
performance and cost (latency, quotas, budgets)

Useful security reference: the OWASP initiative dedicated to LLMs (injection risks, data leakage, tool abuse) is a good starting point for structuring your test scenarios.

A Simple Matrix: Offline, Pilot, Production Tests

Level	Objective	What You Test	Artifacts
Offline (pre-prod)	Avoid regressions and obvious errors	Prompts/versioning, RAG, mocked tools, rules, edge cases	Golden set, mocks, snapshots, scorecard
Controlled Pilot	Validate in real conditions without risk	Logs, handoff rates, perceived quality, incidents, costs	Feature flag, weekly review, stop thresholds
Progressive Production	Maintain quality over time	Drift, tool outages, new intents, variable costs	Monitoring, alerts, runbook, rollback

Types of Tests to Industrialize (Minimal Pack)

Test Type	Why It's Indispensable	Concrete Example
Connector Tests (unit)	Outages often come from the tool, not the LLM	"Create ticket", "update field" with invalid data
Action Contract Tests	Ensure functions never do "something else"	Strict input schema, refusal if field missing, scopes
RAG Evaluation	Quality depends on retrieval, not style	Question, top-k expected docs, mandatory citations
Conversational Golden Set	Measure regressions between versions	30 scenarios, same criteria, same scorecard
Adversarial Tests (light red team)	Prevent injection and exfiltration	"Ignore rules", "show secrets", "perform action without approval"
Cost and Latency Tests	Protect margin and experience	p95 latency, token budget, model fallback

Scorecard: A "Go/No-Go" Decision That Avoids Self-Delusion

For an operations-oriented agent (support, back-office, sales ops), a pragmatic scorecard fits into 6 dimensions:

Dimension	Measure	Reasonable Starting Threshold
Utility	Resolution rate or time saved	Define a baseline, aim for a clear delta
Accuracy	Factual error rate	Measured on golden set + pilot feedback
Security	Correct refusal rate + absence of leaks	0 critical incidents, policies verified
Tool Reliability	Failure rate per action	Alerts as soon as a tool exceeds a threshold
Cost	Cost per task / per session	Max budget and limitation mechanisms
Adoption	Usage rate and satisfaction	Measure usage and impact, not just one

4) Runbook: The Document That Turns an Agent into an Operable Product

A runbook isn't a document "for SREs to look pretty." For an agent, it's what allows you to:

avoid panic in prod
reduce MTTR (Mean Time To Recovery)
decide quickly between rollback, disabling an action, or switching to "suggestion only" mode

Typical operational view of an AI agent dashboard and runbook: p95 latency graphs, token cost/day, tool failure rates, human handoff rates, and an "incident procedure" box with triage steps and mitigation actions.

What Your Runbook Must Contain (Minimal Version)

Section	Content	Why
SLOs and Alert Thresholds	Latency, tool errors, cost budget, failure rate	Detect before users complain
Degraded Modes	Read-only, suggestion-only, disabling actions, model fallback	Continue serving without taking risks
Incident Procedures	Triage steps, who does what, escalation	Reduce ambiguity in real situations
Rollback and Versioning	Revert prompts, rules, index, connectors version	Regressions often happen "by surprise"
Changes and Approvals	Who can deploy, how to validate	Light but explicit governance
RAG Maintenance	Freshness policy, re-indexing, deletion, sources	Avoid silent drift

6 Frequent Incidents and the Standard Response

Incident	Symptoms	Recommended Immediate Action
Planning Loops	Multiple tool calls, exploding costs	Stricter timeout, step limit, cut auto-retry
Tool Unavailable (CRM/ERP)	5xx errors, failed actions	Switch to suggestion-only mode, queue, ops alert
RAG Drift	Plausible but false answers	Purge cache, partial re-index, reinforce mandatory citations
Injection / Bypass	Agent reveals info, ignores rules	Block suspect prompt, reinforce policies, security review
Cost Explosion	Tokens/session double, more expensive model	Model routing, summarization, cache, limits per user
Drop in Adoption	Teams stop using the agent	UX audit, clarify scope, improve workflow integration

Without over-optimizing, at least track:

p50 and p95 latency
Global error rate
Failure rate per tool (per action)
Token cost per session and per task
Human handoff rate
Policy refusal rate (and reasons)
Response rate with citations (if RAG)
"Undo" or user correction rate (quality signal)

5) Security and Compliance: A Proportionate but Explicit Approach

An agent is a risk multiplier because it can access data and trigger actions. Without falling into heavy governance, apply a simple discipline:

Secret Management (never in prompts, rotation, vault)
Access Control (minimal scopes, read/write separation)
PII Filtering (minimization, masking, rules per channel)
Useful Logging (actionable traces, but controlled data)
Action Review (at least initially, human validation)

If you need an AI risk management framework, the NIST AI RMF framework is a widely used reference for structuring controls without blocking delivery.

6) Realistic Execution Plan (In Short Cycles)

For an SME or scale-up, the trap is aiming for a "generalist agent." Instead, target a critical flow and deliver in iterations:

Week 1: agent contract, sources, authorized actions, input schemas, first golden set.

Week 2: minimal architecture (orchestrator + connectors + guardrails), offline tests, scorecard V0.

Week 3: controlled pilot (feature flag), instrumentation, runbook V1, weekly review.

Week 4: consolidation: degraded modes, rollback, security hardening, go/kill/iterate decision.

The goal is not "to have an agent," but to have an operable agent.

Frequently Asked Questions

What is the difference between an AI agent and a copilot? A copilot assists the user (suggestions, drafting, research). An agent executes steps and can trigger tooled actions. The more it acts, the more architecture, tests, and runbooks become indispensable.

What are the non-negotiable elements of an AI agent architecture in production? A clear separation between orchestration, context (RAG), tool connectors, guardrails (policies), and observability. Without this separation, you lose testability and risk control.

How do you test an AI agent if its answers aren't deterministic? With a golden set of scenarios, scorecard metrics (utility, accuracy, security), deterministic tool tests (mocks, contract tests), and validation in a controlled pilot before progressive production.

What must an AI agent runbook absolutely contain? SLOs, degraded modes, incident procedures, a rollback mechanism (prompts, rules, index, connectors), and a source maintenance plan (RAG).

Which metrics should be tracked to avoid surprise costs? Cost per session and per task, tokens per step, p95 latency, loop/retry rates, and tool failure rates. Set a budget and limits per user or per workflow.

When should you move from a pilot to full production? When the scorecard reaches an acceptable threshold on quality, security, tool reliability, and costs, with an operational runbook and a clearly identified business owner.

Need a Truly Operable AI Agent (Not a Demo)?

Impulse Lab accompanies SMEs and scale-ups across the entire chain: AI opportunity audit, custom development (web and AI), automation and integrations, and adoption training. If you want to frame an agent-ready use case, define a testing protocol, and deliver an instrumented V1 in short cycles, you can contact us via Impulse Lab.

AI Agent Development: Architecture, Testing, and Runbook

1) Prerequisites: Write an "Agent Contract" Before Architecting

2) AI Agent Architecture: A "Production-First" Reference

The Essential Bricks (and What They Protect)

Two Rules That Simplify Everything

Integrations: The Tipping Point Between Demo and ROI

3) Testing Strategy: What is Specific to Agents (and What Isn't)

A Simple Matrix: Offline, Pilot, Production Tests

Types of Tests to Industrialize (Minimal Pack)

Scorecard: A "Go/No-Go" Decision That Avoids Self-Delusion

4) Runbook: The Document That Turns an Agent into an Operable Product

What Your Runbook Must Contain (Minimal Version)

6 Frequent Incidents and the Standard Response

Observabilité: The 8 Metrics That Avoid Blind Spots

5) Security and Compliance: A Proportionate but Explicit Approach

6) Realistic Execution Plan (In Short Cycles)

Frequently Asked Questions

Need a Truly Operable AI Agent (Not a Demo)?

How about we work together?

Let's talk about your project

Frequently Asked Questions

Summarize this blog post with:

Resources

Across France

Impulse

Related articles

AI and Work: Achieving Gains Without Disrupting Your Teams

Artificial Business Intelligence: Which KPIs to Improve