Pydantic Output Parsing for LLMs — Bench

The Problem

LLMs return text. Pipelines need data. The gap between those two is where things break.

Ask a model to return JSON and it will — most of the time. Sometimes it wraps it in markdown code fences. Sometimes it adds a preamble (“Here is the JSON you requested:”). Sometimes it renames a field. Sometimes it returns valid JSON that doesn’t match your schema. None of these are predictable, and all of them crash a pipeline that was expecting a clean dict.

PydanticOutputParser turns the schema definition into part of the prompt, and then validates the response against it before your code ever sees it.

The Setup

Define the schema as a Pydantic model:

from pydantic import BaseModel, model_validator
from typing import List, Optional

class Question(BaseModel):
    question_number: Optional[str] = None
    question_letter: Optional[str] = None
    question_text: str
    marks: int

    @model_validator(mode="after")
    def check_question_identifier(cls, values):
        number = values.question_number
        letter = values.question_letter
        if (number is None and letter is None) or (number is not None and letter is not None):
            raise ValueError(
                "A question must have either question_number or question_letter, but not both."
            )
        return values

class ExtractionOutput(BaseModel):
    semester: Optional[str]
    course: Optional[str]
    subject: Optional[str]
    paper_code: Optional[str]
    scheme: Optional[str]
    total_marks: Optional[int]
    sections: List[Section]

Wire up the parser:

from langchain.output_parsers import PydanticOutputParser

extraction_output_parser = PydanticOutputParser(pydantic_object=ExtractionOutput)

The parser injects format instructions into the prompt automatically. The chain:

chain = extract_exam_data_prompt | gpt4_turbo_llm | extraction_output_parser
result: ExtractionOutput = chain.invoke({"paper_text": state["paper_text"]})

result comes out as a typed ExtractionOutput instance, not a raw dict. Downstream code accesses result.sections, result.semester — IDE autocomplete, type checking, the works.

The model_validator Pattern

The Question model’s validator enforces a domain rule that a schema type alone can’t express: each question is identified by either a number (for regular questions) or a letter (for case study sub-questions) — never both, never neither.

@model_validator(mode="after")
def check_question_identifier(cls, values):
    number = values.question_number
    letter = values.question_letter
    if (number is None and letter is None) or (number is not None and letter is not None):
        raise ValueError("...")
    return values

Without this, the model would happily return a question with both fields filled in, or neither — and the downstream note generator would have to handle those cases defensively everywhere. Pushing the constraint into the Pydantic model centralizes it. If the LLM violates it, the parser raises before the bad data propagates.

One Parser Per Stage

Each pipeline stage has its own output model:

class ContextFlowOutput(BaseModel):
    topic: str
    key_concepts: List[str]
    note_help: str

class FirstPassNoteFlowOutput(BaseModel):
    first_pass_note: str

class TeacherEvaluateFlowOutput(BaseModel):
    final_answer: str

Even for single-field outputs like FirstPassNoteFlowOutput, wrapping in a model is worth it. The parser validates that the field exists and isn’t null. Without it, a model that returns {"first_pass_note": null} because it had nothing to say would pass silently and produce an empty note.

What to Watch

Retries on parse failure. The parser raises OutputParserException when it can’t parse the response. LangChain has a RetryWithErrorOutputParser that feeds the error back to the model and asks it to try again. For production pipelines, wire this up — particularly for stages where the prompt is complex and the model occasionally drifts.

Format instructions bloat. The injected format instructions add tokens to every request. For a simple one-field output, the format instruction overhead can exceed the actual content. For those cases, consider JsonOutputParser (less strict) or structured output via the model’s native function/tool-calling API instead.

Pydantic v1 vs v2. LangChain’s PydanticOutputParser historically expected Pydantic v1 models. If you’re on Pydantic v2, use mode="after" in validators (as above) and check which version LangChain is configured for. Mixing v1 validators with v2 runtime causes subtle failures.

What’s Next

Switch extraction to OpenAI’s native structured output / function-calling API — it’s more reliable than asking the model to produce JSON via prompt instructions, since the tokenizer is constrained to valid JSON by default
Add OutputFixingParser as a fallback for the generation stages — these produce longer outputs that are more prone to formatting drift