LLM Quantization and GGUF — Bench

The Problem Quantization Solves

Llama 3 8B in full precision (float32) needs ~32 GB of VRAM. In half precision (float16, the training default), ~16 GB. A MacBook with 16 GB unified memory can’t run it — the model doesn’t fit alongside the OS and other processes.

Quantization compresses the model by reducing the numerical precision of its weights. Instead of 32 bits per weight, use 8. Or 4. Or a mixed scheme. The model shrinks; quality degrades slightly; it runs on consumer hardware.

What Gets Quantized

A transformer is mostly matrix multiplications — attention QKV projections, feed-forward layers. Those matrices contain billions of floating-point weights.

Full precision: each weight is a 32-bit float, representing values across a continuous range.

Quantized: each weight is mapped to a lower-precision format. For 4-bit quantization, 16 possible values per weight (2⁴). The mapping isn’t uniform — it’s learned per block to minimise the error for the actual weight distribution.

Activations (the intermediate values during inference) are typically kept at float16 or bfloat16 even when weights are quantized — de-quantize the weights at inference time, multiply, re-quantize as needed.

GGUF

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama to store quantized models. It’s a single binary file containing:

Model architecture metadata (layer counts, head counts, context length, etc.)
Tokenizer vocabulary and merge rules
All quantized weight tensors

Before GGUF there was GGML, which required separate tokenizer files and had brittle versioning. GGUF is self-contained — one file, one model.

Ollama stores these in ~/.ollama/models/. When you run ollama pull llama3.1:8b, you’re downloading a GGUF file.

Quantization Variants — The Naming Scheme

Ollama model tags encode the quantization level:

llama3.1:8b           → Q4_K_M (default, Ollama-chosen)
llama3.1:8b-q8_0      → 8-bit quantization
llama3.1:8b-q4_K_M    → explicit Q4_K_M
llama3.1:8b-fp16      → half precision, no quantization

The format: Q{bits}_{type}_{size}

Variant	Bits	Size (8B)	Notes
`fp16`	16	~16 GB	Reference quality; needs high VRAM
`Q8_0`	8	~8.5 GB	Near-identical to fp16; good if you have the RAM
`Q6_K`	6	~6.1 GB	Excellent quality, significant size saving
`Q5_K_M`	5	~5.3 GB	Strong quality-size balance
`Q4_K_M`	4	~4.8 GB	Ollama default. Sweet spot for most machines
`Q4_0`	4	~4.5 GB	Older 4-bit scheme; Q4_K_M is better
`Q3_K_M`	3	~3.5 GB	Noticeable degradation; use only if RAM-constrained
`Q2_K`	2	~2.7 GB	Heavy degradation; last resort

The _K suffix means K-quant — a more sophisticated quantization scheme that groups weights into blocks and fits a per-block scale. More expensive to compute, but much better quality for the same bit budget than naive uniform quantization.

_M (medium), _L (large), _S (small) refer to the size of the attention and feed-forward layers within the same quant level — _M uses slightly higher precision for sensitive layers.

Practical Decision

For a MacBook with 16 GB unified memory running Ollama:

Q4_K_M (4.8 GB): use this. Runs with room for OS and other processes. Quality is ~97–98% of fp16 on most benchmarks.
Q8_0 (8.5 GB): if you want maximum quality and aren’t running anything else. Tight on 16 GB.
Q3_K_M and below: only if you’re running a larger model (70B) and have no other choice.

The degradation at Q4_K_M is real but subtle — it shows up in edge cases: precise arithmetic, multi-step reasoning chains, rare vocabulary. For general question-answering, note synthesis, and search (Zion’s use cases), it’s indistinguishable from fp16.

Why This Matters for Zion

Zion uses two models:

llama3.1:8b — generation (Q4_K_M by default via Ollama)
nomic-embed-text — embedding (also quantized; ~274 MB)

Both run concurrently when Zion is active. Combined VRAM/RAM use: ~5–6 GB. On a 16 GB machine, this leaves headroom for Zellij, a browser, and the rest of the dev environment. That’s the practical reason Q4_K_M is the right default — not just size, but co-residency with everything else you’re running.