LLM Structured Outputs: Moving Beyond Raw Text in Production

For two years, I wrote regex that tried to extract JSON from GPT responses. It worked most of the time. “Most of the time” is not a phrase that belongs in a production system.

LLM structured outputs changed how I integrate language models into production pipelines. Not because they’re new — OpenAI shipped them in August 2024, and similar capabilities from other providers followed. Because they finally made the thing I’d been doing badly for two years actually work.

This article is about what I learned moving a production classification pipeline from raw text parsing to strict schema enforcement. The tradeoffs, the failure modes, and the decisions I’d make differently.

Why raw text from LLMs breaks production pipelines

Here’s what the “working” version of our ticket classification pipeline looked like before structured outputs.

The flow was: prompt the model to classify the ticket and return JSON with specific fields, parse the response as JSON, and if parsing failed, strip markdown fences, strip prose preamble, strip trailing commentary, then try again. If that failed, ask the model to “fix the JSON” — another API call. If that failed, log the ticket to a manual review queue.

The failure rate was around 12% on the first call. We’d get back something like Here's the classified ticket:\n\n followed by fenced JSON, followed by Let me know if you need anything else! — and pretend the surrounding prose didn’t exist. Sometimes we’d get three backticks. Sometimes four. Sometimes the model would decide to explain its reasoning inside the JSON, breaking the schema.

The real cost wasn’t the failure rate. It was the defensive code I wrote around it. Our parsing layer grew to 400 lines of string cleanup, markdown stripping, and regex fallbacks. Every time we hit a new edge case, someone added a line. It was a second codebase of brittle heuristics living inside our classification service.

JSON mode vs LLM structured outputs — what actually changed

There’s a distinction here that matters, and most teams conflate them.

JSON mode forces the model to return syntactically valid JSON. That’s it. It guarantees you can parse the response. It does not guarantee the JSON matches what you asked for. You can request {priority: "high" | "medium" | "low"} and get back {priority_level: "urgent"}. Valid JSON, wrong schema.

Structured outputs enforce a schema you provide. The model is constrained at generation time to produce tokens that conform to your schema. Not “usually.” Always.

This is not a small difference. JSON mode solved the parsing problem. Structured outputs solved the schema problem. Different problems, different costs.

The constrained decoding works by filtering the logits at each generation step to only allow tokens that can lead to a valid completion under the schema. OpenAI’s documentation on structured outputs describes the mechanics. It’s not magic. The model is being forced through a grammar at inference time.

The practical result: I stopped writing parsing heuristics. The 400 lines of defensive cleanup got deleted in a single PR.

Before and after of the ticket classification pipeline — JSON mode with regex fallbacks vs strict schema enforcement with LLM structured outputs.

Schema design matters more than you think

When I first switched to structured outputs, I made a predictable mistake. I took my existing Pydantic models — the ones we used for internal validation after parsing — and sent them as the schema. Done. Shipped.

Output quality dropped.

The schema is part of the prompt. The model sees it. A schema designed for runtime validation is not the same as a schema designed to guide a language model. I had fields like ticket_category: str with 47 possible values documented in a separate constants file. The model didn’t have that constants file. It invented categories that made sense contextually but weren’t in our taxonomy.

What worked: enums instead of free-text strings wherever the value space was bounded. Literal["billing", "technical", "account", "other"] instead of str. Field descriptions that explain what the field is for, not what type it is. Nested structures that mirror how I’d describe the output to a human, not how I’d normalize it in a database.

The schema is documentation. The model reads it. Write it for the model.

The other thing that mattered: keeping schemas shallow. Deeply nested schemas (objects containing arrays of objects containing objects) are harder for the model to fill out consistently and harder for you to validate when something goes wrong. Flatten when you can. Use references only when reuse genuinely helps.

Handling partial outputs and validation failures

Structured outputs don’t eliminate failures. They change what failures look like.

The failures I see now are semantic, not structural. The model returns valid JSON matching the schema, but the values are wrong. It picks priority: "low" for a ticket that’s clearly urgent. It categorizes something correctly but hallucinates a suggested_action that doesn’t exist in our action library.

This is a different debugging problem than malformed JSON. Malformed JSON is obvious — your parser throws. A wrong priority classification is invisible unless you’re watching for it.

I covered this pattern in more depth in LLM output validation in production. Schema validation catches structural failures. Semantic validation is a separate layer. Structured outputs don’t change that. They move the failure surface.

What I do now:

Structural validation is free with structured outputs. The schema guarantees it. No code needed.

Semantic validation is a separate layer. For every structured output, I have a validation step that checks values against runtime constraints. Does this category exist? Is this action in our allowlist? Is this timestamp in a reasonable range?

Failures route to a review queue, not a retry loop. If a semantic validation fails, I don’t retry the model. I send it to manual review. Retrying usually produces the same wrong answer with different words.

The other failure mode worth planning for: truncation. The model runs out of tokens mid-response and you get a partial output. With free-form text this is obvious. With structured outputs, you can still get truncation if your max_tokens is too low for the schema you requested. The schema constraint doesn’t extend the context window. I’ve seen teams assume it does and get burned.

The three validation layers around LLM structured outputs — what the schema catches for free, what you still have to build, and what still needs humans.

Retry strategies that actually work

For the 1-2% of cases where structured outputs return semantically wrong data, retrying naively makes things worse. The model tends to repeat itself.

What works:

Different model, same prompt. If a cheap model returns a wrong answer on a borderline case, escalating to a more capable model on retry catches most of the errors. This is cheaper than you’d think because it only kicks in on the small fraction of cases that fail validation.

Same model, modified prompt. Including the failed output in the retry prompt (“this answer was rejected because X, try again”) works about 60% of the time in my pipeline. Not great, but useful.

Human escalation. For the remaining cases, manual review. A 98% automation rate with 2% human review beats a 100% automation rate that’s wrong 5% of the time and nobody notices.

What doesn’t work: retrying with the exact same inputs. I tested this. Same inputs produce the same wrong answer more than 80% of the time, even with temperature above zero. LLM sampling noise is less random than people assume, especially when the schema constrains the output space.

The cost of strict schema enforcement

This part isn’t free.

LLM structured outputs with schema enforcement are slower. In my pipeline, the latency overhead is around 15-20% compared to unconstrained generation. For short outputs this is negligible. For longer structured responses with nested fields, it adds up.

There’s also a token cost. The schema has to be processed by the model on every call. If your schema is 800 tokens and you’re making millions of calls, that’s real money. I pay this cost and think it’s worth it for my use case. I wouldn’t pay it for low-stakes flows where occasional parsing failures are acceptable.

The other tradeoff: less flexibility. Once you commit to a schema, changing it means a coordinated release. The schema is now part of your API contract. I’ve shipped schemas that were too rigid, had to change them, and had to migrate every consumer that parsed the output. Plan the schema like you’d plan a database migration. That’s what it is.

For teams still evaluating whether structured outputs fit their workload, I’d point them to the same framework I use for the rest of my LLM decisions, covered in what I learned the first time I ran an LLM in production and prompt engineering is not a skill, it’s a process. The short version: use structured outputs when you’re parsing the response programmatically. Don’t use them when the output is going to a human.

Before and after from the production pipeline

The ticket classification pipeline I migrated has been running on structured outputs for about eight months now. Here’s what changed.

Before (JSON mode + regex fallbacks):

First-call parse failure rate: 12%
Parsing code: 400 lines
Time team spent debugging parser edge cases: ~4 hours/week
Manual review queue: 8% of tickets
Average latency: 820ms

After (structured outputs with strict schema):

First-call parse failure rate: 0% (by definition)
Parsing code: 40 lines (schema definition and Pydantic validation)
Time team spent debugging parser edge cases: 0
Manual review queue: 3% of tickets (semantic failures only)
Average latency: 960ms

The latency went up. Everything else got better. We kept the manual review queue because some tickets are genuinely ambiguous, and a 3% review rate caught by semantic validation is healthier than a 0% review rate that silently misroutes tickets.

The team time freed up got redirected to improving the classification schema and adding semantic validation checks. Work that actually moves the needle on quality instead of patching around parser failures.

When I wouldn’t use structured outputs

There are cases where I leave structured outputs off.

Free-form generation where the output is displayed to a human. Chat responses, summaries, drafted emails. Schema enforcement adds latency and buys nothing.

Exploratory prompting during development. When I’m still figuring out what I want the model to return, a rigid schema makes iteration slower. I add the schema after the behavior is stable.

Low-stakes flows where occasional parser failures are cheaper than the latency overhead. Internal tools, batch jobs that can retry, cases where a human is already in the loop.

The heuristic: if I’m programmatically routing on the response, structured outputs. If a human is reading it, raw text.

The deeper point about LLM structured outputs

The thing about LLM structured outputs that took me longest to internalize: they’re not a parsing convenience. They’re a design constraint you add to your system. Once you commit to a schema, you’re committing to a contract between your code and the model’s output. That contract is worth maintaining. It’s also work.

Raw text output felt flexible because it was. Every edge case I hit, I wrote another regex. It was flexible in the sense that duct tape is flexible — you can patch anything, but you’re not building anything permanent.

Structured outputs force you to think upfront about what you actually want the model to produce. That’s the real value. The parsing guarantee is a side effect.

Shipping LLM-powered features in production? I’d like to hear what your failure modes look like. Find me on LinkedIn.

Structured Outputs with LLMs: Moving Beyond Raw Text