Running a Local LLM with llama.cpp — Bench

Goal

Run a code-capable LLM fully offline on a MacBook Air M1 — no API keys, no cloud, no latency tax.

Building llama.cpp

Cloned the repo and built from source. Installed cmake via Homebrew first:

brew install cmake

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build
cmake --build build --config Release

Produces binaries at build/bin/ — notably llama-server and llama-quantize.

Downloading a Model

Used hf (Hugging Face CLI, also available via brew) to pull the base model weights:

hf download Qwen/Qwen2.5-Coder-7B --local-dir .

Converting to GGUF

The conversion script is inside the llama.cpp repo. Needs Python deps installed first:

pip install torch transformers sentencepiece protobuf numpy

python3 convert_hf_to_gguf.py . \
    --outfile qwen2.5-coder-7b-f16.gguf \
    --outtype f16

This produces a 16-bit float GGUF — lossless but large.

Quantizing

Quantization compresses the model by reducing weight precision. Q4_K_M is a sweet spot: good quality, ~4 bits per weight, roughly 4–5× smaller than f16.

build/bin/llama-quantize \
    qwen2.5-coder-7b-f16.gguf \
    qwen2.5-coder-7b-Q4_K_M.gguf \
    Q4_K_M

First Attempt — 7B Was Too Slow

Served the 7B model:

build/bin/llama-server \
    --model qwen2.5-coder-7b-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --ctx-size 8192

Sent a test prompt via curl — no response. The M1 Air doesn’t have enough memory bandwidth to run a 7B model comfortably. Timed out.

Switching to 3B — Pre-quantized GGUF

Instead of converting and quantizing manually, downloaded a pre-built Q4_K_M GGUF directly:

hf download bartowski/Qwen2.5-Coder-3B-Instruct-GGUF \
    --include "Qwen2.5-Coder-3B-Instruct-Q4_K_M.gguf" \
    --local-dir ./gguf

bartowski maintains high-quality GGUF builds for popular models — saves the conversion step entirely.

Serving with Metal GPU Offload

The key flag: --n-gpu-layers 99 offloads all layers to Apple’s Metal GPU. Without this, everything runs on CPU and is orders of magnitude slower.

build/bin/llama-server \
    --model ./gguf/Qwen2.5-Coder-3B-Instruct-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --ctx-size 4096 \
    --n-gpu-layers 99

Testing

The server exposes an OpenAI-compatible API — same endpoint format, drop-in replacement:

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "qwen",
      "messages": [{"role": "user", "content": "Write a Python hello world"}]
    }'

Got a clean response. ~28 tokens/sec on the 3B model — fast enough to be usable.

Numbers

From the response timing data:

Prompt processing: ~12 tokens/sec
Generation: ~28 tokens/sec
Context used: 4096 tokens

What I Noticed

The quantization step is where most of the magic is — shrinking a 14GB f16 model down to ~2GB while keeping most of the quality. The K_M in Q4_K_M refers to the k-quant method, which is smarter than naive rounding — it groups weights and picks representative values per block.

The 7B → 3B jump isn’t just about size. The M1 Air has 8GB of unified memory shared between CPU and GPU. A Q4_K_M 7B model is ~4.1GB, which should fit — but inference speed depends on memory bandwidth, not just capacity. The 3B at ~1.9GB leaves headroom, and the GPU can keep more of it hot in cache.