Goose AI Agent with a Local LLM
Wiring Goose (Block's open-source AI agent) to the local llama.cpp server running Qwen2.5-Coder-3B, and hitting the limits of small models on tool-use tasks.
Goal
Connect Goose — Block’s open-source AI agent — to the local llama.cpp server instead of OpenAI. Run real agentic tasks fully offline.
Installation
brew install block-goose-cli
Configuration
Ran goose configure. Selected OpenAI as the provider (since llama.cpp exposes an OpenAI-compatible API), skipped the API key, and pointed the host and base path at the local server:
OPENAI_HOST: http://localhost:8080
OPENAI_BASE_PATH: v1/chat/completions
Selected model: Qwen2.5-Coder-3B-Instruct-Q4_K_M.gguf
Config saved to ~/.config/goose/config.yaml.
First Test
Started a session with goose session and gave it a straightforward task:
Write a python script that takes in a number and multiply it by 1729 and return the answer
Goose returned a TypeScript block via execute_typescript — not Python, and no file was written. Time to respond: ~6s.
Two problems:
- Wrong language (TypeScript instead of Python)
- Tool call formatted as text output, not actually executed
Being More Explicit
Tried a more directive prompt:
Write a Python script to a file called multiply.py that takes a number as input, multiplies it by 1729, and prints the result. Use the write_file tool to save it.
The model responded in 5.15s with:
{
"name": "write",
"arguments": {
"path": "./multiply.py",
"content": "def multiply_number():\n number = int(input('Enter a number: '))\n result = number * 1729\n print('Result:', result)"
}
}
Better — correct language, correct tool name, correct file path, correct number. One issue remains: the file still wasn’t on disk — Goose printed the tool call as text output rather than executing it.
What’s Actually Happening
The 3B model can format a tool call correctly when the prompt is explicit enough. It knows the shape of the answer. But it doesn’t close the loop — it outputs the JSON as a response rather than signaling to the Goose runtime that it wants to invoke a tool.
Larger models understand they’re operating inside an agentic loop: format the call → wait for execution → continue. At 3B, that distinction breaks down. The model treats tool-use syntax as just another text pattern to complete, not as an action to take.
What Worked, What Didn’t
| Task | Result |
|---|---|
| Vague prompt — language unspecified | Wrong language (TypeScript), no file written |
| Explicit prompt — language + tool + filename | Correct Python, correct number, correct tool format — but file not written |
What’s Next
- Try a 7B model — needs more than 8GB RAM headroom, may require a machine with more unified memory
- Try a fine-tuned tool-use model at 3B (e.g.
functionaryor a hermes variant) — smaller models trained specifically for structured output sometimes punch above their weight