Structured Outputs: Constraining LLMs to Produce Machine-Readable Formats Like JSON

Structured outputs refer to the practice of constraining a large language model to produce responses in a predefined, machine-readable format—such as JSON, XML, or a specific programming class—rather than generating free-form text.

Structured outputs refer to the practice of constraining a large language model to produce responses in a predefined, machine-readable format—such as JSON, XML, or a specific programming class—rather than generating free-form text. This ensures that the model's response adheres strictly to a provided schema, making the data immediately usable by downstream software systems without the need for complex parsing or post-processing.

If you have ever tried to get a language model to return a clean JSON object, you know the struggle. You write a carefully worded prompt, provide examples, and explicitly instruct the model to output nothing but JSON. Nine times out of ten, it works. But on that tenth request, the model decides to be helpful and prepends the JSON with "Here is the data you requested:" or wraps it in markdown backticks.

That helpfulness breaks the parser, crashes the pipeline, and ruins the developer's day. Language models are probabilistic text generators. They predict the next most likely token based on their training data. Because they have seen countless examples of JSON wrapped in markdown blocks during training, those backticks are statistically likely to appear, regardless of how sternly you worded your prompt.

Structured outputs solve this problem by moving the constraint out of the prompt and into the generation process itself. This shift represents a fundamental maturation in how we build with AI, turning unpredictable text generators into reliable software components.

‍

The Evolution of Reliability

The journey toward reliable model outputs has gone through several distinct phases, each attempting to solve the parsing problem with increasing levels of sophistication.

In the early days, developers relied entirely on prompt engineering. This involved writing increasingly elaborate instructions, sometimes threatening the model with negative consequences if it deviated from the requested format. This approach was cheap but brittle. According to OpenAI, prompt-based formatting for complex schemas was only about 35.9% reliable (OpenAI, 2024). Developers had to build elaborate retry loops, catching parsing errors and sending the error back to the model with a plea to fix its formatting.

The next evolution was JSON mode, introduced by OpenAI in late 2023 and quickly adopted by other providers. JSON mode guaranteed that the model would produce a valid JSON object. This was a massive improvement over parsing raw text, but it had a critical limitation: it ensured the syntax was valid, but it did not guarantee that the structure matched what the developer actually needed. The model might return valid JSON, but it could invent new keys, omit required fields, or use the wrong data types. It solved the syntax problem but left the semantic structure unresolved.

The current standard is true structured outputs. This approach enforces strict adherence to a provided JSON Schema. If the schema requires a string called "company_name" and an integer called "employee_count", the model will output exactly that structure, every single time. When strict schema adherence is enabled, reliability jumps to 100% (OpenAI, 2024). The model is no longer just generating text; it is fulfilling a strict data contract.

‍

How the Constraints Work

There are three primary ways to achieve structured outputs today, ranging from simple API parameters to deep interventions in the model's inference process.

The most accessible method is using native provider APIs. Major model providers now offer built-in support for structured outputs. Developers supply a JSON Schema alongside their prompt, and the provider's infrastructure ensures the response matches that schema. This is the easiest path for teams building on hosted models, though implementation details vary significantly between providers. Under the hood, providers are doing the heavy lifting to ensure the model cannot deviate from the schema.

For teams hosting their own open-weight models, constrained decoding is the most powerful approach. This technique manipulates the token generation process at inference time. As the model prepares to generate the next token, a finite state machine evaluates which tokens would be valid according to the required schema. The system builds a mask vector that assigns a probability of zero to any token that would violate the schema.

If the schema requires a boolean value, the only valid next tokens are "true" or "false". The model is physically prevented from generating anything else. This approach, popularized by libraries like Outlines and SGLang, guarantees 100% compliance even with smaller, less capable models (Cooper, 2024). It can also speed up generation, as the system can skip calculating probabilities for tokens that are structurally invalid. If the schema requires a closing bracket, the system can simply insert it without asking the model to predict it.

The mechanics of constrained decoding are fascinating. When a language model generates text, it outputs a probability distribution over its entire vocabulary (often 50,000 to 100,000 tokens) for the next word. In unconstrained generation, the system samples from this entire distribution. In constrained decoding, the finite state machine acts as a filter. If the current state of the JSON generation requires a number, the filter masks out all tokens that represent letters or punctuation, leaving only digits. The model then samples from this restricted subset. This means the model doesn't have to "learn" how to format JSON; the formatting is enforced mathematically at the sampling layer.

The third approach is fine-tuning. This involves training a model specifically on structured input-output pairs. While this requires significant upfront investment in dataset preparation and compute, it results in a model that naturally gravitates toward the desired structure without requiring large schemas in the prompt context. This is particularly useful for high-volume tasks where prompt token costs need to be minimized.

‍

The Provider Landscape

The ecosystem for structured outputs is currently fragmented, with different model providers taking different approaches to implementation. This fragmentation is one of the primary challenges for developers building multi-model applications.

Provider	JSON Mode	Function Calling	Strict Schema Enforcement
OpenAI	Yes	Yes	Yes (via `response_format`)
Anthropic	Yes	Yes	No (only via tool use)
Google Gemini	Yes	Yes	No (only via function calling)
Mistral	Yes	Yes	No

‍

OpenAI currently offers the most direct implementation, allowing developers to pass a strict JSON Schema directly in the response_format parameter for standard chat completions (OpenAI, 2024). This makes it incredibly straightforward to request structured data without pretending the model is calling a tool.

Anthropic and Google Gemini take a different path. They do not currently offer a standalone structured output parameter for standard text generation. Instead, developers must use the provider's function calling or tool use APIs to achieve the same result (Vellum, 2024). The developer defines a "dummy tool" with the desired schema and forces the model to call that tool. The arguments the model passes to the tool become the structured output. While effective, this is a workaround rather than a native feature.

This fragmentation has led to the rise of abstraction libraries like Instructor, Pydantic AI, and BAML. These frameworks sit between the developer and the model providers, allowing developers to define their desired output structure using standard programming classes (like Pydantic models in Python or Zod in TypeScript). The framework then translates that class into whatever specific API format the chosen provider requires, handling validation and automatic retries if the model makes a mistake. This allows teams to switch models without rewriting their parsing logic.

‍

Structured Outputs vs. Function Calling

Because some providers require using function calling APIs to get structured outputs, the two concepts are frequently confused. While they use similar underlying mechanics—specifically, JSON Schema enforcement—they serve different architectural purposes.

‍Function calling is used when the model needs to interact with external systems. The model generates the parameters needed to execute a tool, the application executes the tool, and the result is fed back to the model so it can continue its reasoning process. It is an intermediate step in a larger workflow. The model is making a decision about what action to take next.

‍Structured outputs are used when the structured data is the final goal. There is no tool to execute. The application simply needs the model's response formatted in a specific way so it can be saved to a database, rendered in a user interface, or passed to a downstream software component (Vellum, 2024). The model is not taking an action; it is formatting information.

If the model is deciding what to do, use function calling. If the model is just formatting what it knows, use structured outputs.

‍

The Engineering Impact

The shift toward structured outputs has fundamentally changed how teams build AI applications. It moves LLMs from being unpredictable text generators to reliable software components. Industry analysts have recognized this shift, with Thoughtworks recently moving structured outputs to the "Adopt" category in their Technology Radar, calling it a "sensible default for applications that consume LLM responses programmatically" (Thoughtworks, 2026).

One of the most immediate benefits is type safety. In traditional software engineering, type safety ensures that a function receives the exact data types it expects. Structured outputs bring this concept to generative AI. Developers no longer need to write defensive parsing logic or implement complex retry loops to handle malformed responses. The data contract is enforced at the generation layer. This drastically reduces the amount of boilerplate code required to integrate an LLM into a larger application.

This reliability enables zero-validation data pipelines. When a system extracts information from thousands of unstructured documents—such as pulling vendor names and totals from scanned invoices—the engineering team can trust that the resulting data will fit their database schema perfectly. This eliminates the need for human review of the data structure, allowing pipelines to run fully autonomously. It also means that downstream analytics tools and dashboards won't break because a model decided to output a string instead of an integer.

Structured outputs also improve explicit refusals. When a model refuses to answer a prompt due to safety filters, it typically returns a conversational apology. With structured outputs, developers can design schemas that include a specific "refusal" field. If the model triggers a safety filter, it populates that field, allowing the application to handle the refusal programmatically rather than trying to parse an apology string (OpenAI, 2024). This makes it much easier to build robust user experiences that gracefully handle edge cases.

Interestingly, enforcing structure can also enhance the model's reasoning capabilities. When developers require the model to output a "reasoning_steps" array before it outputs a "final_answer" field, they force the model to think through a problem systematically. Because the structure is guaranteed, the application can easily strip away the reasoning steps and present only the final answer to the user. This is a structured implementation of the chain-of-thought prompting technique, and it often yields significantly better results than asking the model for the answer directly.

‍

Categories of Constraints

While JSON is the most common format, structured outputs can enforce a wide variety of constraints depending on the application's needs (Dataiku, 2024).

The simplest constraint is a predefined set of options. For example, an application performing sentiment analysis might require the model to output exactly one of three strings: "positive", "negative", or "neutral". The model is prevented from outputting "somewhat positive" or "mixed".

More complex constraints involve regular expressions. A schema might require the model to output a string that matches the format of an email address or a phone number. If the model attempts to generate a string that violates the regex, the generation is blocked.

The most advanced constraints involve formal grammars. This is particularly useful for code generation. A constrained decoding system can ensure that the model only generates syntactically valid SQL queries or Python functions. It won't guarantee that the code does what the user wants, but it guarantees that the code will compile without syntax errors.

‍

The Connective Tissue of Modern AI

As the industry moves toward complex, multi-agent systems, structured outputs have become the essential connective tissue. When multiple AI agents collaborate on a task, they cannot communicate reliably using free-form text. They need strict data contracts to pass information back and forth.

Consider an AI software factory like Sgai. When a developer agent finishes writing a block of code, it needs to pass that code, along with metadata about dependencies and testing requirements, to a reviewer agent. That handoff requires a strict data contract. The developer agent must produce a structured output that the reviewer agent's input parser expects. Without guaranteed structure, the multi-agent system would collapse under the weight of parsing errors. The agents would spend all their time trying to understand each other's formatting rather than doing actual work.

Similarly, documentation automation platforms like Doc Holiday rely on structured outputs to translate raw code commits into formatted release notes and API documentation. The system needs to know exactly where the summary ends and the technical details begin. With a strict schema enforced, the platform can reliably generate documentation that fits perfectly into a company's existing templates.

Structured outputs represent the maturation of generative AI. They are the mechanism that allows language models to stop acting like chatbots and start acting like reliable infrastructure. When the burden of formatting moves from the prompt to the generation engine, developers can finally build AI applications with the same level of predictability they expect from traditional software.