Setting Up Ollama + Llama 3.1:8b Locally — Bench

What is Ollama

Ollama is a runtime for running large language models locally — think Docker, but for LLMs. It handles model downloads, quantization, memory management, and exposes a local REST API at localhost:11434. No cloud, no API keys, no data leaving the machine.

Installation

On macOS:

brew install ollama

Start the server:

ollama serve

It runs as a background daemon. First launch, it sets up ~/.ollama/ as the model store.

Pulling Llama 3.1:8b

ollama pull llama3.1:8b

Downloads the GGUF-quantized model (~4.7 GB for the Q4_K_M variant). Ollama handles quantization selection automatically based on available RAM.

Once pulled, run it directly:

ollama run llama3.1:8b

Drops into an interactive chat session in the terminal. First response took a couple of seconds to load the model into memory; subsequent turns were fast.

REST API

Ollama exposes a local API, so you can call it from code:

curl http://localhost:11434/api/generate \
  -d '{"model":"llama3.1:8b","prompt":"Explain entropy in one paragraph.","stream":false}'

The stream: false flag waits for the full completion — useful for scripting. Default is streaming (newline-delimited JSON chunks).

Python equivalent:

import requests

resp = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.1:8b",
    "prompt": "Explain entropy in one paragraph.",
    "stream": False
})
print(resp.json()["response"])

That’s the full stack — model running locally, callable from any script.