Mark-Aware Prompting — Scaling LLM Output Depth by Context — Bench

The Insight

An exam question like “What is motivation?” means completely different things depending on how many marks it carries. At 2 marks: a one-line definition. At 15 marks: definition, theory overview, multiple frameworks, examples, comparisons, summary. The underlying topic is identical. The expected depth is not.

A note generator that ignores mark weight produces notes that are either bloated (for 2-mark questions) or superficial (for 15-mark questions). Both fail the student.

The Implementation

A simple lookup function maps marks to word limits:

def get_word_limit(mark):
    if mark == 1:
        return 50
    elif mark == 2:
        return 75
    elif mark == 5:
        return 200
    elif mark == 10:
        return 500
    elif mark == 15:
        return 700
    else:
        return 150  # fallback for unexpected values

Both the mark value and the word limit flow into the prompt as variables:

first_pass_notes_builder_prompt = ChatPromptTemplate.from_template("""
...
- Assume this is a {mark}-mark question.
- Adapt note depth based on mark value:
  - 3–5 marks: Focus on definitions, key terms, and 1–2 bullet points of explanation.
  - 6–10 marks: Add subtopics, brief theory explanations, 1–2 examples.
  - 10+ marks: Provide structured detail, 2+ examples, comparisons, and a brief recap.
- Keep it under {word_limit} words
...
""")

chain.invoke({
    "question": state["question"],
    "mark": mark,
    "word_limit": word_limit,
    ...
})

The model gets two signals: the numerical mark (which it uses for the “adapt note depth” rubric) and the word limit (a hard constraint on output length).

Why Two Signals Instead of One

You could pass just the word limit and let the model infer depth. Or just the mark and let it infer length. Passing both gives the model a calibration point: it knows that a 500-word limit for a 10-mark question is intentional, not an error or a truncation. This reduces hallucinated length mismatches where the model writes 100 words for a 10-mark question because it “felt” like enough.

The qualitative depth rubric in the prompt (“Add subtopics, brief theory explanations, 1–2 examples”) also matters separately from the word count. A model can technically produce 500 words of bullet-point definitions. The rubric steers it toward the right kind of content at each level.

The Same Pattern, Generalized

The underlying idea is context-aware prompting: any dimension of the expected output that varies by context should be passed as an explicit variable rather than left to the model’s implicit inference.

Other applications of the same pattern:

Context variable	What it controls
Audience level (`beginner` / `expert`)	Technical depth, jargon usage
Output format (`slack` / `email` / `report`)	Length, tone, structure
Time constraint (`2-min read` / `deep dive`)	Compression vs. expansion
Domain (`medical` / `legal` / `general`)	Disclaimer requirements, precision

The model can make reasonable inferences about all of these from context clues, but explicit variables are more reliable than inference, especially across a large batch where individual questions may not provide enough context for the model to calibrate correctly.

What to Watch

Magic number fragility. The get_word_limit function is a simple lookup with a hardcoded fallback. If an exam paper has a 7-mark question, it falls through to 150 words — probably wrong. A continuous function (e.g. word_limit = mark * 45) would generalize better, though it loses the intentionality of hand-tuned values per tier.

The teacher evaluation stage also needs the mark. Both the note builder and the reviewer receive {mark} and {word_limit}. If the reviewer doesn’t know the mark weight, it might “fix” a correctly concise 2-mark note by expanding it — the opposite of what’s needed. Every stage that touches content depth needs the context variables, not just the generation stage.

What’s Next

A/B test the depth rubric: does the qualitative rubric (“Add subtopics…”) actually improve notes vs. word-limit-only prompting? Score outputs manually on a held-out set
Add a section variable (A/B/C) — Section A questions in MBA papers are typically short-answer, Section C is always the case study. Different sections have different structural expectations beyond just mark weight