Setting Up Ollama + Llama 3.1:8b Locally
First run of a local LLM — installing Ollama, pulling llama3.1:8b, and running inference entirely on-device.
What is Ollama
Ollama is a runtime for running large language models locally — think Docker, but for LLMs. It handles model downloads, quantization, memory management, and exposes a local REST API at localhost:11434. No cloud, no API keys, no data leaving the machine.
Installation
On macOS:
brew install ollama
Start the server:
ollama serve
It runs as a background daemon. First launch, it sets up ~/.ollama/ as the model store.
Pulling Llama 3.1:8b
ollama pull llama3.1:8b
Downloads the GGUF-quantized model (~4.7 GB for the Q4_K_M variant). Ollama handles quantization selection automatically based on available RAM.
Once pulled, run it directly:
ollama run llama3.1:8b
Drops into an interactive chat session in the terminal. First response took a couple of seconds to load the model into memory; subsequent turns were fast.
REST API
Ollama exposes a local API, so you can call it from code:
curl http://localhost:11434/api/generate \
-d '{"model":"llama3.1:8b","prompt":"Explain entropy in one paragraph.","stream":false}'
The stream: false flag waits for the full completion — useful for scripting. Default is streaming (newline-delimited JSON chunks).
Python equivalent:
import requests
resp = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.1:8b",
"prompt": "Explain entropy in one paragraph.",
"stream": False
})
print(resp.json()["response"])
That’s the full stack — model running locally, callable from any script.