Mark-Aware Prompting — Scaling LLM Output Depth by Context
Encoding exam mark weight as a prompt variable to control note depth and word count — the same question at 2 marks and 15 marks needs a fundamentally different answer, and the model needs to know which one it's writing.
The Insight
An exam question like “What is motivation?” means completely different things depending on how many marks it carries. At 2 marks: a one-line definition. At 15 marks: definition, theory overview, multiple frameworks, examples, comparisons, summary. The underlying topic is identical. The expected depth is not.
A note generator that ignores mark weight produces notes that are either bloated (for 2-mark questions) or superficial (for 15-mark questions). Both fail the student.
The Implementation
A simple lookup function maps marks to word limits:
def get_word_limit(mark):
if mark == 1:
return 50
elif mark == 2:
return 75
elif mark == 5:
return 200
elif mark == 10:
return 500
elif mark == 15:
return 700
else:
return 150 # fallback for unexpected values
Both the mark value and the word limit flow into the prompt as variables:
first_pass_notes_builder_prompt = ChatPromptTemplate.from_template("""
...
- Assume this is a {mark}-mark question.
- Adapt note depth based on mark value:
- 3–5 marks: Focus on definitions, key terms, and 1–2 bullet points of explanation.
- 6–10 marks: Add subtopics, brief theory explanations, 1–2 examples.
- 10+ marks: Provide structured detail, 2+ examples, comparisons, and a brief recap.
- Keep it under {word_limit} words
...
""")
chain.invoke({
"question": state["question"],
"mark": mark,
"word_limit": word_limit,
...
})
The model gets two signals: the numerical mark (which it uses for the “adapt note depth” rubric) and the word limit (a hard constraint on output length).
Why Two Signals Instead of One
You could pass just the word limit and let the model infer depth. Or just the mark and let it infer length. Passing both gives the model a calibration point: it knows that a 500-word limit for a 10-mark question is intentional, not an error or a truncation. This reduces hallucinated length mismatches where the model writes 100 words for a 10-mark question because it “felt” like enough.
The qualitative depth rubric in the prompt (“Add subtopics, brief theory explanations, 1–2 examples”) also matters separately from the word count. A model can technically produce 500 words of bullet-point definitions. The rubric steers it toward the right kind of content at each level.
The Same Pattern, Generalized
The underlying idea is context-aware prompting: any dimension of the expected output that varies by context should be passed as an explicit variable rather than left to the model’s implicit inference.
Other applications of the same pattern:
| Context variable | What it controls |
|---|---|
Audience level (beginner / expert) | Technical depth, jargon usage |
Output format (slack / email / report) | Length, tone, structure |
Time constraint (2-min read / deep dive) | Compression vs. expansion |
Domain (medical / legal / general) | Disclaimer requirements, precision |
The model can make reasonable inferences about all of these from context clues, but explicit variables are more reliable than inference, especially across a large batch where individual questions may not provide enough context for the model to calibrate correctly.
What to Watch
Magic number fragility. The get_word_limit function is a simple lookup with a hardcoded fallback. If an exam paper has a 7-mark question, it falls through to 150 words — probably wrong. A continuous function (e.g. word_limit = mark * 45) would generalize better, though it loses the intentionality of hand-tuned values per tier.
The teacher evaluation stage also needs the mark. Both the note builder and the reviewer receive {mark} and {word_limit}. If the reviewer doesn’t know the mark weight, it might “fix” a correctly concise 2-mark note by expanding it — the opposite of what’s needed. Every stage that touches content depth needs the context variables, not just the generation stage.
What’s Next
- A/B test the depth rubric: does the qualitative rubric (“Add subtopics…”) actually improve notes vs. word-limit-only prompting? Score outputs manually on a held-out set
- Add a
sectionvariable (A/B/C) — Section A questions in MBA papers are typically short-answer, Section C is always the case study. Different sections have different structural expectations beyond just mark weight