Multi-Stage LLM Review Pipelines — Bench

The Single-Shot Problem

A single prompt asking an LLM to “write high-quality study notes for this exam question” will produce acceptable output most of the time. The failure mode isn’t obvious errors — it’s quiet incompleteness. The model writes confidently, produces well-structured text, and misses two of the five key concepts the question was really about.

You don’t catch that unless you know the topic well enough to check. A student relying on those notes won’t.

The Pipeline Architecture

The answer is to separate generation from evaluation. Different roles, different prompts, different models if needed.

Context Builder → First Pass Note Builder → Teacher Evaluator → Clarity Booster → Review Note Builder

Each stage has a narrow, defined job:

Context Builder — Given the raw question, infer the topic, identify key concepts, and write instructional guidance for the downstream note generator.

# Returns: { topic, key_concepts, note_help }
chain = context_builder_prompt | openai_llm | context_flow_output_parser

First Pass Note Builder — Generate the actual study notes using the context from stage 1 as scaffolding.

# Uses: question, topic, key_concepts, note_help, mark, word_limit
# Returns: { first_pass_note }
chain = first_pass_notes_builder_prompt | gemini_llm | first_pass_note_flow_output_parser

Teacher Evaluator — Review the first pass note. Check every key concept from stage 1. If any are missing or underdeveloped, fill them in. If the depth doesn’t match the mark weight, fix that too.

# Uses: question, topic, key_concepts, first_pass_note, mark, word_limit
# Returns: { final_answer }
chain = teacher_review_prompt | openai_llm | teacher_evaluate_note_output_parser

Clarity Booster (disabled) — Simplify language without losing content. Targets students who struggle with academic jargon.

Review Note Builder (disabled) — Final accuracy check. Catches vague statements, missing examples, confusing parts.

Why Separation Works

The generator and the evaluator are adversarial by design. The generator optimizes for fluency and coherence — it produces text that sounds complete. The evaluator optimizes for coverage against a checklist. These objectives are in tension, which is exactly the point.

A single model asked to both generate and evaluate its own output will rationalize the gaps. It’s the same reason authors make poor copy editors for their own work: the mental model of what should be there overrides what’s actually there.

Splitting the roles makes the evaluation prompt independent. The teacher evaluator doesn’t know what the generator was “trying” to write — it only sees what it produced and checks it against the key concepts from stage 1. That independence is what makes the check meaningful.

The Cost Tradeoff

Running five LLM calls per question on a 20-question paper is 100 API calls per pipeline run. At GPT-4o or Gemini Pro rates, that’s significant for a prototype with no monetization.

The v2 iteration disabled the Clarity Booster and Review Note Builder:

def process_question(state: State, question: QuestionState) -> State:
    context_result = context_builder({"question": question["question"]})
    question.update(context_result)

    first_pass_result = first_pass_note_builder(question)
    question.update(first_pass_result)

    teacher_result = teacher_evaluate_note_builder(question)
    question.update(teacher_result)

    # clarity_result = clarity_booster_note_builder(question)  # disabled
    # review_result = review_note_builder(question)            # disabled

    return question

Three calls per question instead of five. The teacher evaluator is the quality bottleneck — if notes are good enough after that stage, the last two stages add polish but not substance. For a prototype, that’s the right cut.

What Each Stage Actually Does to Output Quality

In practice across a test set of exam questions:

Stage removed	Quality impact
Context Builder	Significant — generator produces generic notes without topic scaffolding
Teacher Evaluator	Significant — key concept coverage drops noticeably
Clarity Booster	Minor — academic language remains but still readable
Review Note Builder	Marginal — catches occasional vague statements

This is the order to re-enable stages if cost allows: teacher evaluator first (already enabled), review second, clarity third.

What to Watch

Error compounding. If the Context Builder misidentifies the topic or misses a key concept, the error propagates through every downstream stage. The first pass note will be off. The teacher evaluator will evaluate against the wrong checklist. The final output will be confidently wrong. Stage 1 is the most important place to get right.

The disabled stages aren’t gone. They’re commented out, not deleted. The state schema still has second_pass_note and reviewed_note fields. When they’re re-enabled, the fields populate and the downstream storage captures them. The architecture is already wired for five stages — cost is the only reason two are off.

Prompt coupling. The teacher evaluator’s prompt references {key_concepts} from stage 1. If stage 1’s output schema changes, stage 3’s prompt breaks. Schema changes need to propagate across all dependent stages — this is the main maintenance cost of a multi-stage design.

What’s Next

Enable the Review Note Builder and benchmark output quality against the three-stage version on a held-out set
Add a scoring stage: ask a final model to rate the note on a 1–5 scale against the key concepts — use this as a gate before writing to the database rather than storing everything unconditionally
Explore caching: if the same question appears in multiple exam papers (common for MBA courses), the first pass note and context can be reused