Agents: When a Single Prompt Isn’t Enough
Why serious automation needs agents, not just clever prompts—and how to design, monitor, and ship robust agentic workflows using planners, tools, memory, and guardrails.

A single, well‑crafted prompt can do a lot. In the previous PractiqAI article we treated prompts as specifications: instruction, constraints, context, examples, output format.
But at some point you hit a wall:
- “Book me a train, pick the cheapest sensible option, check my calendar, and send a summary to my manager.”
- “Watch this mailbox, triage incoming messages, draft replies, and raise an alert if anything smells like fraud.”
- “Crawl a knowledge base, build a plan to answer the user’s question, and then keep monitoring for changes.”
You can’t squeeze all of that into one prompt and hope for the best. You don’t just want “text out”; you want behavior over time, with tools, memory, and guardrails.
That’s where agents live.
OpenAI’s platform now treats agents as systems that can independently accomplish tasks for users, orchestrated through the Responses API, built‑in tools (web search, file search, computer use), and the Agents SDK plus observability. This article is a practical tour of that world: what agents add beyond prompts, how to keep them bounded and safe, and what patterns to start with.
What agents add beyond prompts
A prompt is one big “think step” plus an answer. An agent is more like a mini‑application that happens to think with an LLM.
The core upgrades you get when you move from “single prompt” to “agent” are:
- Multi‑step behavior: The agent can plan, act, observe results, and then choose what to do next. This can look like “decide which tools to call, call them, re‑plan, repeat”.
- Tool use: Instead of only predicting text, agents can call APIs and built‑in tools like web search, file search, or computer use for browser/desktop control.
- State and memory: The agent carries state across steps (and sometimes across sessions), so it can remember decisions, cache results, and avoid re‑doing expensive work.
- Autonomy with guardrails: The system can keep running without a human on each step, as long as it stays within your safety, cost, and time limits.
- Observability: Agent frameworks (like the Agents SDK and Agent Builder) give you traces, metrics, and dashboards so you can see what actually happened and why.
If you’ve used ChatGPT’s Operator or ChatGPT agent features—where the model uses a browser or its own computer to complete complex tasks—you’ve seen this pattern in the wild: it plans, clicks, scrolls, reads, re‑plans, and only then hands you a result.
On PractiqAI, you can think of each task (plus “judge” model) as a tiny, evaluated loop already: you propose a prompt, the model acts, a judge checks whether the criteria are met, and you iterate. That’s agentic behavior in miniature—just with humans still doing the planning and tool selection.
Agents take those loops and embed them into code and workflows so that machines can own more of the “decide → act → check → repeat” cycle.
Planner–executor basics (finite loop > infinite loop)
Most production agents boil down to some form of planner–executor:
- The planner decides what should happen next (“search docs”, “run calculation”, “ask user to clarify”).
- The executor carries out that action (call a tool, run a sub‑LLM call, hit a database), then hands the results back to the planner.
You can implement this visually with Agent Builder (nodes for each step, branching by outcome) or in code with the Agents SDK, which basically gives you a structured way to write these loops.
Under the hood, a minimal planner–executor in code might look like:
type AgentState = {
goal: string;
steps: string[];
context: Record<string, unknown>;
};
async function runAgentTask(initialGoal: string) {
let state: AgentState = { goal: initialGoal, steps: [], context: {} };
for (let i = 0; i < MAX_STEPS; i++) {
const plan = await callPlannerLLM(state);
if (plan.status === "done") break;
const toolResult = await executeTool(plan.next_action);
state.steps.push(plan.next_action.description);
state.context[plan.next_action.id] = toolResult;
}
return state;
}This is intentionally boring—and that’s a good thing.
Prefer boring loops
When in doubt, choose a finite, predictable loop with explicit steps over a “clever” infinite loop that promises magic. You’ll debug and ship the boring one.
The key ideas:
- Finite loop: Always have a maximum number of iterations (
MAX_STEPS), a time limit, or both. Never rely on “the model will stop itself”. - Explicit state: Keep one clear state object that includes the goal, past steps, and any cached results.
- Planner contract: The planner LLM gets the state and must respond in a constrained format: either
"done"or an explicitnext_actionwith a tool name and arguments. - Executor isolation: The executor doesn’t improvise; it just runs what the planner asked for, validates arguments, and returns results.
The Agents SDK and cookbook examples lean heavily into this pattern, with features like “background mode” for long‑running tasks and “parallel agents” when multiple subtasks can safely run at the same time.
Once you grasp this planner–executor loop, most of the rest of agent design is about: which tools are exposed, how state is managed, and when the loop must stop.
Tool selection & safeguards
Giving an agent tools is like handing a junior colleague a corporate credit card and a VPN login: potentially powerful, definitely risky.
OpenAI’s agent stack gives you both built‑in tools (web search, file search, computer use, etc.) and custom tools you define yourself. You then wire these into an agent via the Responses API, Agent Builder, or the Agents SDK.
The design questions are:
- Which tools should this agent even see?
- Use tool allowlists per agent (or even per task type). Your “customer support summarizer” probably needs file search and internal APIs, but not Stripe refunds or computer use.
- Consider separate agents for risky tools: one agent can assemble a refund proposal, and a separate, more tightly guarded workflow can actually call the payments API.
- How does the agent choose tools?
- The model can pick tools dynamically based on the user’s goal and current state. This is where clear tool schemas matter: name, description, argument types, and constraints.
- For example, web search should have descriptions like “Use when you need current or external information not present in context,” to avoid pointless calls.
- What safeguards sit around tool calls?
- Static policy checks in code (e.g., blocks on certain parameter values, rate limiting, user‑role checks before executing a tool).
- Dynamic model checks, where a separate “safety” or “policy” model evaluates a proposed tool call before it runs. OpenAI’s internal agent systems use layered safety techniques, and the “new tools for building agents” launch highlights observability and policy as first‑class concerns.
- For dangerous primitives like computer use (a model controlling a virtual machine via clicks and keystrokes), keep it isolated to non‑production environments or behind explicit user approvals.
- How do tools fail safely?
- Every tool should define how it signals errors: return types, retry strategies, and what counts as “fatal” vs “soft” failure.
- The planner LLM should see these errors in structured form (“rate_limit_exceeded”, “not_found”) so it can react intelligently instead of hallucinating success.
In practice, you’ll often be stricter than you think you need. It’s trivial to loosen a constraint; it’s much harder to unwind a production agent that just spent the night spamming every customer with “test emails”.
Memory: what to persist (and what not to)
Once agents can act over time, you need to decide what they remember.
At minimum, you’ll juggle three layers of memory:
- Ephemeral run state – The “scratchpad” for a single task run: plan, steps taken, tool results, partial outputs.
- Session memory – Things relevant within a user session or ongoing conversation (preferences, unresolved subtasks).
- Long‑term memory – Knowledge that survives across sessions and tasks (documents, historical tickets, decisions).
OpenAI’s stack gives you pieces for this too: file search and vector stores for durable knowledge; background tasks and agent traces for long‑running workflows.
The real art is deciding what not to keep:
- Avoid raw PII unless absolutely necessary. Store identifiers and references instead of free‑text user content when you can. For example, store
user_idandticket_id, not the full email body, and use file search or a database to fetch content on demand. - Summarize aggressively. Long chat histories and audit logs don’t fit comfortably in a prompt. Use the model to produce tight summaries (with links back to the raw data) and keep those as your “working memory”.
- Separate “knowledge” from “logs”. User‑facing knowledge (docs, FAQs, KB articles) belongs in your retrieval system. Detailed traces of what the agent did (every tool call) belong in your observability system, not in the model’s context window.
- Define retention policies. Even aside from regulation, it’s simply useful to decide: run state is kept for hours, session memory for days/weeks, long‑term knowledge until explicitly pruned.
A good heuristic: persist what future tasks can productively reuse (a robust summary, a structured decision, a link to a canonical record), and keep everything else as short‑lived telemetry.
Stop conditions & human checkpoints
A naive agent is a toddler with espresso: infinite energy, no idea when to stop.
You need explicit stop conditions baked into the code and often backed by the model’s own self‑assessment.
Common stop conditions include:
- Goal satisfied: The planner model signals
"done"and returns a structured summary of the result (ideally with evidence and links). - Budget exceeded: You’ve hit a ceiling on tokens, steps, or wall‑clock time. For example, “at most 20 LLM calls or 2 minutes of runtime.”
- No progress: The last N steps look the same (looping on the same tools, or the planner keeps re‑asking the same question).
- Risk threshold: A safety model or rules engine flags the task as too risky (“user may be trying to bypass policy”, “tool call targets a forbidden resource”).
Human checkpoints are just stop conditions with a person in the middle. Some typical patterns:
- Pre‑approval: Before the agent calls write‑side tools (payment APIs, CRM updates, code deployment), it generates a plan and asks a human to approve it.
- Post‑approval: The agent executes a change but holds it in a draft or “staging” state until a human clicks “publish”.
- Exception routing: The agent handles the happy path autonomously but escalates ambiguous or high‑risk cases to a human queue.
Agent Builder makes this very concrete: your workflow is a graph of nodes, and you can insert manual approval nodes or error branches. In a code‑first setup with the Agents SDK, you express the same idea with branching logic around your planner output.
If you come from PractiqAI’s world of judged tasks, a human checkpoint is like exposing the judge’s verdict before committing a side effect. Instead of “points or no points”, the outcome is “okay to charge this card” or “nope, hand this over to a human”.
Telemetry: task success and safety events
Once agents run without constant supervision, you need telemetry that answers, “Is this thing actually working?”
OpenAI’s “new tools for building agents” launch explicitly calls out integrated observability tools and tracing for agent workflows. Cookbook examples show how to hook agent runs into evaluation and tracing systems like Langfuse as well.
Useful telemetry typically covers:
1. Task‑level metrics
Per “run” of the agent:
- Was the task successful? (Binary or graded score.)
- How many LLM calls, tool calls, and tokens did it use?
- How long did it take from start to finish?
- Which tools were used (and how often)?
This is the agent‑equivalent of PractiqAI’s “task passed, here’s what you learned, here’s your score” moment.
2. Step‑level traces
For each step in the planner–executor loop:
- The planner prompt and response (sanitized).
- The tool chosen, its arguments, and its raw result.
- Any safety or policy checks triggered at that step.
This is what you need to debug weird behavior (“why did it call web search five times in a row?”).
3. Safety and policy events
- Content filtering hits (e.g., when the model tries to respond with disallowed content).
- Blocked tool calls (e.g., attempts to access protected resources).
- Rate‑limit errors and retries.
You don’t want to drown in logs, so pick a small set of health metrics first: success rate, cost per successful task, average latency, and the number of safety events. Then layer on richer traces only for slow, expensive, or failing runs.
Cost & latency control (parallelism, caching)
Agents can explode your token bill if left unchecked. The good news: OpenAI’s newer models and tools give you levers to manage both cost and latency.
The GPT‑5.1 release, for example, introduced adaptive reasoning (spend fewer tokens on easy tasks, more on hard ones) and extended prompt caching (keep shared context in cache for up to 24 hours at 90% cheaper input cost).
Here are practical moves:
-
Right‑size reasoning effort For GPT‑5.1, you can choose
reasoning_effortvalues like"none","low","medium","high". Fast agents default to"none"for simple routing or formatting and only use higher efforts for genuinely hard sub‑tasks. -
Exploit parallelism where safe Some workflows are naturally parallel. The Agents SDK cookbook includes a “Parallel Agents” example that shows multiple agents working in parallel on independent tasks before merging results.
- Great candidates: calling multiple microservices, evaluating multiple documents, generating variants for A/B testing.
- Bad candidates: steps that depend tightly on each other’s outputs, or that modify shared state.
-
Use caching for shared context If each run of your agent needs the same big blob of instructions or knowledge, place it behind prompt caching (where supported), or store it in a vector store and fetch only relevant slices. Prompt caching on GPT‑5.1 can dramatically reduce cost for long‑lived sessions or repeated workflows.
-
Impose budgets per task and per user
- Stop an agent run that burns more than X tokens or takes longer than Y seconds.
- Log budget breaches so you can spot prompts or users that invite pathological behavior.
- Use background mode strategically Some tasks don’t need instant answers. A background‑mode agent can chew through expensive work more slowly but cheaply, while your front‑end responds quickly with “working on it” status. The Agents SDK docs and building‑agents track cover background mode patterns.
Think of cost/latency control as another safety system. The agent’s job isn’t “keep thinking forever”; it’s “solve the task within a reasonable budget”.
Debugging agents trace‑by‑trace
Debugging a single prompt is often just eyeballing the input and output.
Debugging an agent is more like debugging a distributed system: you need traces.
Modern OpenAI tooling gives you agent traces—a hierarchical view of the overall run, each planner call, each tool call, and all the intermediate states. The “New tools for building agents” announcement highlights integrated observability so you can log, visualize, and analyze workflow execution.
When you’re debugging, look for patterns like:
- Loops: The planner keeps proposing the same action (e.g., “search the web” with slightly different queries).
- Tool misuse: The agent calls the wrong tool for a job (using file search when information is already in context).
- Over‑thinking: Simple tasks are using high reasoning effort, lots of tokens, or multiple tools when one would do.
- Silent failures: A tool returns an error and the planner ignores it, hallucinating success instead.
A simple, repeatable workflow:
- Capture a failing or expensive run from telemetry.
- Inspect the trace: planner prompts, tool calls, results.
- Identify the root cause (prompt too vague, tool description misleading, missing stop condition).
- Fix the spec, not just the symptom: tighten tool descriptions, adjust planner instructions, or change the control loop.
- Re‑run the same scenario and compare traces before/after.
PractiqAI’s “judge model” approach is actually a debugging asset here: if you design your agents around verifiable outputs (checklists, schemas, business rules), you get crisp signals when something goes wrong instead of vague “this felt weird” incidents.
Evaluation methods for agents
How do you know an agent is good enough to ship… and keep shipping as models and tools change?
OpenAI’s Evals API and cookbook cover several evaluation patterns specifically for agents and tool‑using workflows, including structured outputs, tool evaluation, and end‑to‑end agent behavior.
You can borrow a few battle‑tested ideas:
1. Golden tasks with judges
Prepare a set of realistic tasks:
- Input: user goal + initial context.
- Expected: structured rubric or reference answer.
Let the agent run, then have a separate judge model score the result using that rubric (“correctness”, “policy compliance”, “effort saved”). This is basically how PractiqAI trains you on prompting: tasks with objective criteria and an automated judge.
2. Tool‑specific evals
For each tool (or group of tools):
- Feed scenarios that should use the tool, and check the agent actually calls it with reasonable arguments.
- Feed scenarios where the tool should not be used (e.g., when the answer is already in context) and check it’s avoided.
The cookbook’s “Evals API Use‑case – Tools Evaluation” walk‑through shows how to do this with Responses and the agentic stack.
3. Safety evals
Design red‑team scenarios:
- Prompts that push the agent to break policy.
- Attempts to trick the agent into calling dangerous tools or exfiltrating data.
Run them regularly and ensure your guardrails fire correctly. Capture these cases as regression tests so future changes don’t reopen old holes.
4. Live traffic sampling
Once deployed, sample a fraction of real agent runs:
- Have judges (models or humans) score quality, safety, and user satisfaction.
- Track drift over time—after a model upgrade, after adding a tool, after tweaking prompts.
Evaluation for agents is less about a single “accuracy” number and more about a dashboard: success rate, cost per success, time‑to‑success, and safety incident rate. When those trend in the right direction, you’re winning.
Good starter agent patterns
Let’s make this concrete. If you want to move beyond single prompts into agents, here are patterns that play nicely with the OpenAI stack (Agents SDK, Agent Builder, Responses, built‑in tools).
1. “Tool‑using copilot”
What it does: Enhances a single task (like answering support questions or analyzing documents) by selectively calling tools: web search, file search, or a handful of APIs.
Why it’s good:
- Small conceptual leap from “chat with tools”.
- Clear success metrics and low risk.
How to build it:
- Start with the Responses API and a single agent definition: one model, a small set of tools, no complex planning logic.
- Use file search for your knowledge base and web search for fresh external info.
2. Planner + worker for structured tasks
What it does: Handles tasks that need decomposition: “Analyze these 10 documents, group them by theme, then draft a summary report.”
Pattern:
- Planner agent: breaks the task into steps (“chunk docs → summarize chunks → cluster → draft report”).
- Worker agent(s): execute the mechanical steps (summarize a chunk, run a specific API).
You can implement this in the Agents SDK, or visually in Agent Builder with a graph of nodes corresponding to planning and execution.
3. Retrieval‑first knowledge agent
What it does: Answers questions using internal documents, with web search only as a fallback.
Pattern:
- First step is always a retrieval tool call (file search, vector store).
- The planner inspects retrieved passages and decides: “answer from docs”, “ask a clarifying question”, or “call web search”.
- Judge the final answer against reference docs for hallucination‑resistance.
This is ideal for internal knowledge bases, policy bots, or documentation copilots.
4. Voice or call‑center agent
What it does: Handles voice calls: listens, reasons, acts, responds. OpenAI’s voice agents guides show how to combine the Agents SDK, real‑time models, and telephony to build this.
Pattern:
- Real‑time speech‑to‑speech or speech‑to‑text pipeline.
- Under the hood, the same planner–executor loop with tools, plus real‑time constraints (low latency, graceful handoff to human agents).
Start narrow (e.g., “reset my password” flows) with strict stop conditions and human handoff.
5. Multi‑agent “committee” for complex decisions
OpenAI’s cookbook includes a multi‑agent portfolio collaboration example: multiple agents playing roles like “risk analyst”, “portfolio optimizer”, and “compliance checker”.
You can adapt this whenever:
- The task naturally decomposes into distinct expert roles.
- You want explicit checks and balances (e.g., a “critic” agent reviewing another agent’s work).
This pattern is best as a later step once you’ve nailed single‑agent workflows; otherwise, you risk creating a very polite but expensive argument between three LLMs.
Starter references (for when you actually build)
If you want to go from “nice article” to “running agent”, these official resources are the shortest path:
- OpenAI Agents overview & Responses docs – Conceptual overview of agents, Responses API, built‑in tools, and observability.
- Agents SDK docs & track – Code‑first agents, planner–executor patterns, background mode, multi‑agent examples, and integration with external tracing providers.
- Agent Builder quickstarts – Visual canvas for agentic workflows; guides and videos on designing node‑based flows without writing orchestration code.
- Practical guide to building agents (PDF) – A deeper, more architectural look at agent workflows and trade‑offs, including contrasts between visual builders and code‑first SDKs.
Pair those with PractiqAI‑style deliberate practice—clear tasks, judges, and feedback loops—and you’ll not only understand agents, you’ll actually be able to design them, debug them, and ship them.
Because in the end, an agent is just your old friend the prompt… wrapped in loops, tools, memory, and guardrails, and finally given a job to do.

Paweł Brzuszkiewicz
PractiqAI Team
PractiqAI designs guided drills and feedback loops that make learning with AI feel like muscle memory training. Follow along for product notes and workflow ideas from the team.
Ready to make AI practice part of your routine?
Explore interactive drills, daily streaks, and certification paths built by the PractiqAI team.
Explore coursesArticle snapshot
Why read: Agents: When a Single Prompt Isn’t Enough
Why serious automation needs agents, not just clever prompts—and how to design, monitor, and ship robust agentic workflows using planners, tools, memory, and guardrails.
Reading time
19 min read
Published
2026-01-23
Practical takeaways
Built for operators who want actionable next steps—not just theory—so you can test ideas immediately.
What it covers
Agents, Prompting, LLMs
Structured navigation
Use the table of contents to jump between key sections and return to examples faster.
Apply with PractiqAI
Pick a course or task after reading to reinforce the ideas with real prompts and AI feedback.
Latest articles
Fresh insights from the PractiqAI team.

Prompt Injection: The Security Risk
A practical, security-minded guide to prompt injection: what it is, why it’s OWASP’s #1 LLM risk, how indirect attacks work, and how to defend real tools, RAG systems, and courses against it.

What Are Vector Databases
A practical guide to vector databases: what they actually do (vs. simple vector stores), how indexing and hybrid search work, and how to model, operate, and choose them for real-world AI apps.

What is Chain‑of‑Thought (CoT)
A practical, opinionated guide to chain‑of‑thought prompting: what it is, how self‑consistency decoding works, when long reasoning helps, and when you should keep models terse and tool‑driven.