How to Ship Progressive Agents
How to ship progressive agents
A progressive agent is an LLM-powered workflow that gains agency in controlled steps. You start with a narrow, observable workflow. Then you add planning, tool use, memory, branching, retries, and approval gates only when the previous stage has passed tests in realistic conditions.
This is different from launching a general chatbot and hoping prompts keep it in bounds. A progressive agent has a defined job, known inputs, allowed tools, measurable outputs, and a rollout path. Each new capability needs an eval, a trace, and a rollback plan.
For example, a support refund agent should not begin with full CRM write access. A safer progression looks like this:
- Classify the ticket and suggest a refund policy section.
- Draft a reply using approved policy text.
- Read order details through a restricted tool.
- Recommend a refund decision with citations.
- Submit the refund only after reviewer approval.
- Auto-submit refunds under $25 after passing live evals for several weeks.
The agent becomes more useful over time, but each step is tied to evidence. That is the core shipping pattern.
Start with a bounded workflow
Before you add agent behavior, define the workflow as a normal engineering system. Write down the job, inputs, outputs, failure modes, and owner. If you cannot describe the workflow without using the word “agent,” it is probably too vague.
A good first version should answer these questions:
- What decision is the model allowed to make? Example: classify a support ticket as refund, billing, account, or technical.
- What decision is outside scope? Example: issuing a refund, changing a subscription, or promising a credit.
- What data can it read? Example: ticket text, order amount, plan type, and refund policy.
- What data can it write? Example: draft reply text and internal notes only.
- What is a bad output? Example: refund approval without policy support, invented order data, or missing citation.
- Who reviews failures? Example: support ops reviews flagged traces every weekday morning.
This first version often looks closer to a static agent than an open-ended planner. That is fine. Static workflows are easier to evaluate, debug, and deploy. You can add more flexible behavior later.
Use an agent capability matrix
A capability matrix helps your team decide what to ship now, what to gate, and what to reject. It also prevents vague debates about whether the agent is “ready.”
| Capability | Current status | Required eval | Permission level | Rollback action |
|---|---|---|---|---|
| Ticket classification | Enabled in production | 95% accuracy on 1,000 labeled tickets | Read ticket only | Return to rule-based routing |
| Draft support reply | Enabled for reviewers | 90% policy citation match | Write draft only | Disable generated drafts |
| Read order details | Beta cohort | Zero unauthorized field access in traces | Read restricted fields | Remove order lookup tool |
| Recommend refund | Internal testing | Less than 2% incorrect recommendations | Recommend only | Hide recommendation panel |
| Submit refund | Blocked | Live shadow test for 30 days | Write financial action | Keep reviewer approval required |
Update this matrix during each release review. If a capability does not have an eval and rollback action, it should not ship.
Design the before and after workflow
Progressive agents work best when the team can compare the old workflow with the proposed agent workflow. Keep this diagram simple enough that product, support, security, and engineering can all review it.
Before
- User submits a support ticket.
- Routing rules assign a queue.
- Support agent reads the policy page.
- Support agent opens the order record.
- Support agent drafts a response.
- Reviewer approves refunds above $50.
- Support agent sends the response.
After initial agent rollout
- User submits a support ticket.
- LLM classifies the ticket and cites the reason.
- Workflow retrieves the relevant policy section.
- Agent drafts a response with policy citations.
- Reviewer accepts, edits, or rejects the draft.
- Rejected drafts enter the eval dataset.
- Accepted drafts are logged with trace metadata.
Notice that the first rollout does not require tool writes, long-horizon planning, or autonomous action. It removes repetitive work while preserving reviewer control over customer-facing output.
Add tools after evals, not before
Tool access makes an agent more capable and harder to debug. Each tool should have a reason to exist, a strict input schema, a permission level, and trace coverage. Avoid giving the model a broad internal API wrapper. Use narrow tools with names that describe a single action.
| Tool | Allowed use | Blocked use | Approval needed | Logging requirement |
|---|---|---|---|---|
| get_order_summary | Read order total, date, status, and product names | Read payment details or full address | No | Log ticket ID, order ID, fields returned |
| get_refund_policy | Retrieve approved policy text by product and region | Retrieve draft policy documents | No | Log policy version and section ID |
| create_refund_recommendation | Write an internal recommendation | Submit refund to payment system | No | Log rationale, citations, confidence, reviewer decision |
| submit_refund | Submit approved refund under configured limits | Submit refund without approval or outside policy | Yes | Log approver, amount, policy citation, final status |
If you are using the OpenAI Agents SDK, keep tool definitions small and trace every tool call. The model should not decide permissions. Your application code should enforce permissions before and after tool execution.
Pick the right agent pattern for the stage
Progressive does not mean every release needs more autonomy. Sometimes the correct next step is better retrieval, cleaner schemas, or stricter evals.
Use a simple routing workflow when the agent only needs to classify and send work to the right path. Use a planner only when the task requires sequencing. Use a dynamic workflow only after you have enough trace data to know where fixed paths are breaking down.
- Stage 1: Static workflow. Fixed steps, narrow prompts, no tool writes. Best for first production release.
- Stage 2: Tool-assisted workflow. The model can call read-only tools. Best when answers require private or changing data.
- Stage 3: Plan and execute workflow. The model creates a plan, then executes approved steps. Best for multi-step tasks with clear success criteria. See plan-and-execute agents for the core pattern.
- Stage 4: Dynamic workflow. The agent chooses paths based on state, tool results, and previous outcomes. Use this when fixed flows create too many dead ends. Read more about dynamic agents before shipping this pattern.
Most teams should spend more time in stages 1 and 2 than they expect. Those stages produce the data you need for safer planning and branching later.
Build evals for each new capability
An eval should map to a specific risk. Do not use one generic “quality” score for the whole agent. A support refund agent needs separate evals for classification, policy citation, tool selection, refund recommendation, tone, and refusal behavior.
A practical eval suite might include:
- Golden dataset: 500 to 2,000 labeled examples taken from real production cases, cleaned for privacy.
- Regression set: 50 to 200 cases that previously failed or caused reviewer edits.
- Adversarial set: 100 cases with missing data, angry users, policy exceptions, prompt injection, and conflicting instructions.
- Tool-use tests: Cases that require the correct tool, no tool, or a refusal to call a tool.
- End-to-end traces: Full workflow runs that verify final output, intermediate decisions, and tool arguments.
Set release gates before you run the eval. For example:
| Eval | Minimum to ship | Blocker condition | Owner |
|---|---|---|---|
| Ticket classification | 95% accuracy | Any class below 90% | ML engineer |
| Policy citation | 92% exact section match | Any invented citation | Support ops |
| Tool selection | 98% correct call or no-call | Unauthorized tool attempt | Backend engineer |
| Refund recommendation | 98% safe recommendation rate | Incorrect approval above $50 | Product owner |
| Prompt injection resistance | 100% refusal on known injection set | Any instruction override accepted | Security reviewer |
These numbers are examples, not universal targets. A code migration agent, medical intake agent, or finance workflow may need stricter gates. The key is to make the gate explicit before the team sees the result.
Trace every step
Progressive agents need trace data because failures often happen between steps. The final answer may look fine while the agent used the wrong policy version, called an unnecessary tool, or ignored a low-confidence classifier result.
A useful trace should include:
- Prompt version and model version
- Input payload with sensitive fields redacted
- Retrieved documents and versions
- Tool calls, arguments, return values, and latency
- Intermediate model decisions
- Evaluator results
- Reviewer edits and final action
- User-visible output
A compact trace log can look like this:
{
"trace_id": "tr_82f41",
"workflow": "refund_agent_v2",
"prompt_version": "refund_draft_2025_02_14",
"model": "gpt-4.1",
"ticket_type": "refund_request",
"steps": [
{
"name": "classify_ticket",
"output": "refund_request",
"confidence": 0.97,
"eval": "pass"
},
{
"name": "get_refund_policy",
"tool_args": {
"region": "US",
"product": "annual_subscription"
},
"policy_version": "2025-02-01",
"eval": "pass"
},
{
"name": "get_order_summary",
"tool_args": {
"order_id": "ord_redacted"
},
"fields_returned": ["total", "date", "status", "product_names"],
"eval": "pass"
},
{
"name": "draft_response",
"citations": ["refund_policy.section_3.2"],
"eval": "pass"
}
],
"reviewer_action": "edited",
"reviewer_edit_reason": "tone_too_formal"
}This format gives engineers enough detail to reproduce failures. It also gives product and operations teams enough context to decide whether the next rollout stage is ready.
Use reviewer edits as training and eval data
Reviewer edits are one of the best data sources for progressive agents. Capture the original output, edited output, edit reason, and final decision. Then route the example into the right dataset.
Use a small taxonomy for edit reasons:
- Wrong policy: The response cites or applies the wrong rule.
- Missing data: The response should have used order, account, or ticket details.
- Bad tone: The response is too formal, vague, cold, or apologetic.
- Unsafe action: The agent recommends an action outside policy.
- Hallucination: The response includes unsupported facts.
- Unclear reasoning: The recommendation lacks enough evidence for approval.
Do not dump all edits into one fine-tuning bucket. Some edits indicate prompt changes. Some indicate missing retrieval data. Some indicate a product policy gap. Some should become regression tests. Label first, then decide the fix.
Roll out in small cohorts
Progressive agents should ship through cohorts, not one global release. Start with internal users, then a small beta group, then a limited production cohort. Keep the ability to disable specific capabilities without turning off the whole workflow.
A rollout sequence can look like this:
- Internal shadow mode: Run the agent on historical tickets. Do not show output to support agents.
- Live shadow mode: Run the agent on live tickets. Compare output with actual support decisions.
- Reviewer assist: Show drafts and recommendations to trained reviewers only.
- Limited production: Use the agent for one queue, one region, or one product line.
- Expanded production: Add more queues after evals and trace review pass.
- Conditional automation: Allow narrow actions under strict thresholds, such as refunds under $25 with exact policy match.
Each phase should have a stop condition. Example stop conditions include a 2% unsafe recommendation rate, more than 5 unauthorized tool attempts in 24 hours, or a citation failure in any high-risk category.
Keep rollback boring
Rollback should be a normal path, not an emergency project. Build feature flags at the capability level. You should be able to disable tool writes while keeping classification and draft generation live.
Useful flags include:
- agent_enabled: Turns the whole workflow on or off.
- draft_generation_enabled: Controls user-visible or reviewer-visible drafts.
- read_tools_enabled: Allows read-only tool calls.
- write_tools_enabled: Allows state-changing actions.
- planner_enabled: Allows model-generated plans.
- auto_action_enabled: Allows actions without reviewer approval under configured limits.
Pair these flags with clear ownership. If nobody knows who can disable a risky capability at 2 a.m., the rollout plan is incomplete.
Use a release checklist
Before each new capability ships, run a short release review. Keep it close to engineering reality. The goal is to catch missing evals, missing traces, and unclear permissions.
- The workflow has a named owner.
- The capability matrix is updated.
- Prompt versions are pinned.
- Model versions are pinned or change-controlled.
- Tool schemas are reviewed and tested.
- Permissions are enforced in application code.
- Eval thresholds are documented before the run.
- Regression tests pass.
- Prompt injection tests pass.
- Trace logging covers every model call and tool call.
- Reviewer feedback is captured with reason codes.
- Feature flags exist for rollback.
- Dashboards track volume, latency, cost, failure rate, and unsafe outputs.
- Support and operations teams know what changed.
- The next review date is scheduled.
Common mistakes to avoid
Adding planning too early
Planning adds value when the agent has to sequence actions. It adds risk when the task is a fixed path with a few conditional branches. If a decision tree handles 95% of cases, start there and use the model for classification, extraction, and drafting.
Giving broad tool access
A tool called update_customer is too broad for most early agents. Split it into narrow tools, such as update_shipping_address, add_internal_note, and create_refund_recommendation. Narrow tools are easier to test and safer to monitor.
Evaluating only the final answer
Final-answer evals miss many agent failures. Test the plan, retrieved context, tool choice, tool arguments, citations, and final response. A correct answer produced through an unsafe path should still fail.
Skipping live shadow mode
Offline evals help, but they rarely cover every production pattern. Live shadow mode shows latency, missing fields, policy drift, tool errors, and real user phrasing before the agent affects the workflow.
Confusing confidence with permission
A high model confidence score should not grant write access. Permissions should come from policy, user role, workflow stage, eval status, and application code checks.
A practical shipping pattern
If you are starting this week, use this sequence:
- Pick one workflow with clear business value and frequent repetition.
- Write the capability matrix for the first three releases.
- Build a static version with pinned prompts and no write tools.
- Create a golden dataset with at least 200 real examples.
- Add evals for classification, citation, unsafe output, and format.
- Run the agent in shadow mode and collect traces.
- Ship reviewer assist for one queue or team.
- Capture reviewer edits with reason codes.
- Add one read-only tool after evals show the need for it.
- Review trace failures weekly and update datasets.
This pattern gives your team a reliable path to more capable agents without pretending every workflow is ready for broad autonomy. The best progressive agents earn each new permission through tests, traces, and controlled production use.
PromptLayer helps AI teams manage prompts, run evals, trace agent workflows, and track production behavior as agents gain new capabilities. If you are building progressive agents, create an account at https://dashboard.promptlayer.com/create-account.