Back

How to Build With the OpenAI Responses API

May 29, 2026
How to Build With the OpenAI Responses API

How to Build With the OpenAI Responses API

The OpenAI Responses API gives you one interface for model calls, tool use, multimodal inputs, structured outputs, and agent-style workflows. If you are building LLM features in production, treat it as an application runtime boundary, not as a simple rename of Chat Completions.

The main implementation shift is state. A response can produce output, tool calls, reasoning metadata, and a response ID you may need in the next turn. Your app should decide what state to keep, what to replay, what to trace, and what to validate before any tool runs.

1. Start with the right mental model

In Chat Completions, many apps were built around one request containing a full message array:

system + user + assistant + tool messages => next assistant message

With the Responses API, you should think in terms of response objects and output items:

instructions + input + tools + previous_response_id => response with output items

That difference matters when you build agents. Your code needs to inspect the response, detect tool calls, execute tools, send tool outputs back, and continue until the model returns a final answer or your app hits a safety limit.

If you already use OpenAI through PromptLayer, connect your calls through the OpenAI integration so you can trace request inputs, model outputs, latency, cost, prompt versions, and evaluation results in one place.

2. Make your first Responses API call

A minimal call usually needs a model, instructions, and input. Keep the instructions separate from user-controlled text. Do not place system-level rules inside the user prompt.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

const response = await client.responses.create({
  model: "gpt-4.1-mini",
  instructions: "You are a concise support assistant. Answer with accurate, practical steps.",
  input: "How do I reset my API key?",
});

console.log(response.output_text);
console.log(response.id);

Store the response ID if the next request should continue from this state. Also store your own application-level conversation ID, user ID, prompt version, and trace ID. Do not rely on provider state as your only source of truth.

3. Handle response IDs deliberately

The previous_response_id field lets you continue from an earlier response without sending the full conversation every time. This can reduce request size and simplify multi-step flows.

const followUp = await client.responses.create({
  model: "gpt-4.1-mini",
  previous_response_id: response.id,
  input: "Can you give me the exact steps for the dashboard?",
});

Use this carefully. Your production system should answer these questions before you ship:

  • What response ID maps to this user session? Store it with your own conversation record.
  • When do you reset state? Reset when the user starts a new task, changes tenant, changes permissions, or crosses a security boundary.
  • What do you replay for debugging? Keep traces with the effective prompt, input, tool calls, tool outputs, model, latency, and final output.
  • What happens if provider state is unavailable? Keep enough local state to recover or gracefully restart the workflow.

For long conversations, be intentional about what context the model sees. Important facts can get buried when prompts grow, which can cause lost-in-the-middle behavior. Summarize old turns, keep current task details near the end, and store durable facts outside the prompt when possible.

4. Build the tool-calling loop

Most agentic apps need a loop:

  1. Send user input and available tools to the model.
  2. Inspect the response output.
  3. If the model requested a tool, validate the arguments.
  4. Run the tool with your app permissions.
  5. Send the tool result back using the tool call ID.
  6. Repeat until the model returns a final answer or you hit a limit.

Here is a simplified JavaScript example using a function tool:

const tools = [
  {
    type: "function",
    name: "get_order_status",
    description: "Get the current status of a customer order.",
    parameters: {
      type: "object",
      additionalProperties: false,
      properties: {
        order_id: {
          type: "string",
          description: "The order ID, such as ord_12345"
        }
      },
      required: ["order_id"]
    }
  }
];

let response = await client.responses.create({
  model: "gpt-4.1-mini",
  instructions: "Help users check order status. Use tools when needed.",
  input: "Where is order ord_12345?",
  tools
});

for (let step = 0; step < 5; step++) {
  const functionCalls = response.output.filter(
    item => item.type === "function_call"
  );

  if (functionCalls.length === 0) {
    console.log(response.output_text);
    break;
  }

  const toolOutputs = [];

  for (const call of functionCalls) {
    const args = JSON.parse(call.arguments);

    const result = await getOrderStatus({
      orderId: args.order_id
    });

    toolOutputs.push({
      type: "function_call_output",
      call_id: call.call_id,
      output: JSON.stringify(result)
    });
  }

  response = await client.responses.create({
    model: "gpt-4.1-mini",
    previous_response_id: response.id,
    input: toolOutputs,
    tools
  });
}

Add a hard step limit. A value between 3 and 8 is enough for many customer support, internal search, code assistance, and data lookup workflows. If the loop hits the limit, return a controlled failure response and log the trace for review.

5. Validate tool arguments before execution

Tool schemas guide the model, but they are not a security boundary. Validate every argument in your application before calling a database, queue, browser, payment API, file system, or internal service.

For example, if a tool expects an order ID, do more than check that it is a string:

  • Validate the shape, such as ord_ followed by digits or a UUID.
  • Check that the signed-in user has access to that order.
  • Reject extra fields if your tool does not need them.
  • Apply rate limits for expensive or sensitive tools.
  • Return safe errors to the model, not stack traces or secrets.
import { z } from "zod";

const OrderStatusArgs = z.object({
  order_id: z.string().regex(/^ord_[0-9]+$/)
}).strict();

function parseOrderStatusArgs(rawArguments) {
  let parsed;

  try {
    parsed = JSON.parse(rawArguments);
  } catch {
    throw new Error("Invalid JSON tool arguments");
  }

  return OrderStatusArgs.parse(parsed);
}

Use separate validation for model output and tool output. A tool result can also contain unsafe or irrelevant text, especially if it comes from search results, documents, tickets, emails, or customer-submitted content.

6. Keep instructions out of user input

System and developer instructions belong in the API fields designed for them. User input should contain user-controlled content only.

A weak pattern looks like this:

input: `
You are a billing assistant.
Never reveal internal policy.
User question: ${userMessage}
`

A safer pattern separates trusted instructions from untrusted input:

await client.responses.create({
  model: "gpt-4.1-mini",
  instructions: `
You are a billing assistant.
Follow company policy.
Do not reveal private internal notes.
Ask for clarification if the account cannot be identified.
`,
  input: userMessage
});

This separation will not solve prompt injection by itself, but it makes your intent clearer, improves trace quality, and reduces accidental instruction mixing. For tool-using agents, also include tool-specific rules in trusted instructions, such as “Do not call refund tools unless the user is authenticated and the account status has been checked.”

7. Use structured outputs when your app needs a contract

If downstream code expects JSON, request structured output instead of asking the model to “return valid JSON” in a sentence. Use a schema with strict fields, then validate the returned value in your code.

const response = await client.responses.create({
  model: "gpt-4.1-mini",
  instructions: "Classify the support ticket.",
  input: "I was charged twice for my subscription.",
  text: {
    format: {
      type: "json_schema",
      name: "ticket_classification",
      schema: {
        type: "object",
        additionalProperties: false,
        properties: {
          category: {
            type: "string",
            enum: ["billing", "bug", "account", "feature_request"]
          },
          priority: {
            type: "string",
            enum: ["low", "medium", "high"]
          },
          summary: {
            type: "string"
          }
        },
        required: ["category", "priority", "summary"]
      },
      strict: true
    }
  }
});

Still validate the parsed object. Treat the schema as a model constraint and your application validator as the final gate.

8. Add retries, timeouts, and cancellation

Production LLM calls fail in normal ways: network errors, rate limits, model timeouts, overloaded dependencies, and tool failures. Your app should handle these cases without hanging the user request.

  • Set a request timeout. For interactive UI flows, 15 to 45 seconds is a reasonable starting range.
  • Retry transient failures. Retry 429, 500, 502, 503, and 504 responses with exponential backoff.
  • Do not retry unsafe tool side effects blindly. For refunds, purchases, emails, database writes, and ticket updates, use idempotency keys.
  • Cancel abandoned work. If the user closes the session or starts a new request, cancel or ignore stale responses.
  • Return controlled errors. Give users a clear next step instead of exposing provider errors.

A simple retry setup might use 2 retries with backoff delays around 500 ms and 1,500 ms for interactive calls. Batch jobs can tolerate more retries and longer backoff.

9. Trace every step of an agent run

An agent run is harder to debug than a single completion. You need visibility into each model call, selected tool, tool arguments, tool result, final answer, cost, and latency.

At minimum, capture:

  • Model name and parameters
  • Prompt or instruction version
  • User input, with sensitive data redacted where needed
  • Response ID and previous response ID
  • Output items, including tool calls
  • Validated tool arguments
  • Tool execution result or error class
  • Total steps in the loop
  • Final user-visible answer
  • Evaluation results, if the run is part of a test set

If you are using OpenAI’s agent tooling, the OpenAI Agents SDK integration can help your team inspect agent traces and compare behavior across prompt versions.

10. Test with evals before you ship

Manual testing with 10 happy-path prompts is not enough for an agentic workflow. Build an evaluation set that matches real production traffic.

For a support agent, include cases like:

  • Simple answer with no tool needed
  • Tool lookup with valid user permissions
  • Tool lookup where the user lacks access
  • Prompt injection inside a ticket, email, or document
  • Ambiguous user request
  • Missing account ID or order ID
  • Tool timeout
  • Malformed tool result
  • Long conversation with an old important detail
  • User asks for an action the agent must refuse

Score the run on concrete criteria. For example:

  • Task success: Did the agent solve the user’s request?
  • Tool correctness: Did it call the right tool with valid arguments?
  • Permission handling: Did it avoid exposing data across accounts?
  • Format compliance: Did it return the expected schema?
  • Latency: Did the run finish under your product target, such as 8 seconds for chat or 60 seconds for background work?

Run these evals when you change prompts, models, tools, schemas, retrieval settings, or state handling. A small suite of 50 realistic cases can catch regressions that casual testing will miss.

11. Avoid common implementation mistakes

Treating Responses API as a drop-in rename

Do not migrate by swapping endpoint names and leaving the rest of your architecture unchanged. Update your parsing, state handling, tracing, tool loop, and error handling for response objects and output items.

Ignoring response IDs

If your app uses multi-turn state, store response IDs intentionally. If your app should be stateless, avoid accidental continuation from old responses.

Trusting tool arguments

The model can produce invalid, stale, or unsafe arguments. Validate them before execution and check permissions inside the tool handler.

Mixing trusted instructions with user content

Keep instructions in trusted fields. Put user text in input. Treat retrieved documents, web pages, tickets, and emails as untrusted content too.

Skipping retries and timeouts

Every production path needs bounded runtime. Add timeouts, retries for transient errors, and safe behavior for partial failures.

Shipping without traces or evals

If you cannot inspect what the agent did, you cannot reliably fix it. Trace every step and test against realistic cases before rollout.

12. Production checklist

  • Use separate fields for trusted instructions and user input.
  • Store your own conversation ID, response ID, prompt version, and trace ID.
  • Set a max step count for tool loops.
  • Validate all tool arguments with strict schemas.
  • Check user permissions inside every sensitive tool.
  • Use idempotency keys for side-effecting actions.
  • Set request timeouts and retry transient failures.
  • Log response IDs, tool calls, tool outputs, latency, and final answers.
  • Build evals from real user requests and edge cases.
  • Compare new prompts and models before rolling changes into production.

The Responses API is a strong foundation for agentic LLM apps when you build around its actual shape: response state, output items, tool calls, structured outputs, and traceable execution. The teams that get reliable results usually treat prompts, tools, evals, and observability as part of the same system.


PromptLayer helps AI engineering teams manage prompts, trace OpenAI Responses API runs, evaluate changes, and debug agent workflows before they reach users. Create a free account at https://dashboard.promptlayer.com/create-account.

The first platform built for prompt engineering