Running a Local LLM with llama.cpp
Building llama.cpp from source, downloading Qwen2.5-Coder, quantizing to Q4_K_M, and serving it locally on a MacBook Air M1.
Goal
Run a code-capable LLM fully offline on a MacBook Air M1 — no API keys, no cloud, no latency tax.
Building llama.cpp
Cloned the repo and built from source. Installed cmake via Homebrew first:
brew install cmake
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
Produces binaries at build/bin/ — notably llama-server and llama-quantize.
Downloading a Model
Used hf (Hugging Face CLI, also available via brew) to pull the base model weights:
hf download Qwen/Qwen2.5-Coder-7B --local-dir .
Converting to GGUF
The conversion script is inside the llama.cpp repo. Needs Python deps installed first:
pip install torch transformers sentencepiece protobuf numpy
python3 convert_hf_to_gguf.py . \
--outfile qwen2.5-coder-7b-f16.gguf \
--outtype f16
This produces a 16-bit float GGUF — lossless but large.
Quantizing
Quantization compresses the model by reducing weight precision. Q4_K_M is a sweet spot: good quality, ~4 bits per weight, roughly 4–5× smaller than f16.
build/bin/llama-quantize \
qwen2.5-coder-7b-f16.gguf \
qwen2.5-coder-7b-Q4_K_M.gguf \
Q4_K_M
First Attempt — 7B Was Too Slow
Served the 7B model:
build/bin/llama-server \
--model qwen2.5-coder-7b-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 8192
Sent a test prompt via curl — no response. The M1 Air doesn’t have enough memory bandwidth to run a 7B model comfortably. Timed out.
Switching to 3B — Pre-quantized GGUF
Instead of converting and quantizing manually, downloaded a pre-built Q4_K_M GGUF directly:
hf download bartowski/Qwen2.5-Coder-3B-Instruct-GGUF \
--include "Qwen2.5-Coder-3B-Instruct-Q4_K_M.gguf" \
--local-dir ./gguf
bartowski maintains high-quality GGUF builds for popular models — saves the conversion step entirely.
Serving with Metal GPU Offload
The key flag: --n-gpu-layers 99 offloads all layers to Apple’s Metal GPU. Without this, everything runs on CPU and is orders of magnitude slower.
build/bin/llama-server \
--model ./gguf/Qwen2.5-Coder-3B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 4096 \
--n-gpu-layers 99
Testing
The server exposes an OpenAI-compatible API — same endpoint format, drop-in replacement:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"messages": [{"role": "user", "content": "Write a Python hello world"}]
}'
Got a clean response. ~28 tokens/sec on the 3B model — fast enough to be usable.
Numbers
From the response timing data:
- Prompt processing: ~12 tokens/sec
- Generation: ~28 tokens/sec
- Context used: 4096 tokens
What I Noticed
The quantization step is where most of the magic is — shrinking a 14GB f16 model down to ~2GB while keeping most of the quality. The K_M in Q4_K_M refers to the k-quant method, which is smarter than naive rounding — it groups weights and picks representative values per block.
The 7B → 3B jump isn’t just about size. The M1 Air has 8GB of unified memory shared between CPU and GPU. A Q4_K_M 7B model is ~4.1GB, which should fit — but inference speed depends on memory bandwidth, not just capacity. The 3B at ~1.9GB leaves headroom, and the GPU can keep more of it hot in cache.