Goose AI Agent with a Local LLM — Bench

Goal

Connect Goose — Block’s open-source AI agent — to the local llama.cpp server instead of OpenAI. Run real agentic tasks fully offline.

Installation

brew install block-goose-cli

Configuration

Ran goose configure. Selected OpenAI as the provider (since llama.cpp exposes an OpenAI-compatible API), skipped the API key, and pointed the host and base path at the local server:

OPENAI_HOST:      http://localhost:8080
OPENAI_BASE_PATH: v1/chat/completions

Selected model: Qwen2.5-Coder-3B-Instruct-Q4_K_M.gguf

Config saved to ~/.config/goose/config.yaml.

First Test

Started a session with goose session and gave it a straightforward task:

Write a python script that takes in a number and multiply it by 1729 and return the answer

Goose returned a TypeScript block via execute_typescript — not Python, and no file was written. Time to respond: ~6s.

Two problems:

Wrong language (TypeScript instead of Python)
Tool call formatted as text output, not actually executed

Being More Explicit

Tried a more directive prompt:

Write a Python script to a file called multiply.py that takes a number as input, multiplies it by 1729, and prints the result. Use the write_file tool to save it.

The model responded in 5.15s with:

{
  "name": "write",
  "arguments": {
    "path": "./multiply.py",
    "content": "def multiply_number():\n    number = int(input('Enter a number: '))\n    result = number * 1729\n    print('Result:', result)"
  }
}

Better — correct language, correct tool name, correct file path, correct number. One issue remains: the file still wasn’t on disk — Goose printed the tool call as text output rather than executing it.

What’s Actually Happening

The 3B model can format a tool call correctly when the prompt is explicit enough. It knows the shape of the answer. But it doesn’t close the loop — it outputs the JSON as a response rather than signaling to the Goose runtime that it wants to invoke a tool.

Larger models understand they’re operating inside an agentic loop: format the call → wait for execution → continue. At 3B, that distinction breaks down. The model treats tool-use syntax as just another text pattern to complete, not as an action to take.

What Worked, What Didn’t

Task	Result
Vague prompt — language unspecified	Wrong language (TypeScript), no file written
Explicit prompt — language + tool + filename	Correct Python, correct number, correct tool format — but file not written

What’s Next

Try a 7B model — needs more than 8GB RAM headroom, may require a machine with more unified memory
Try a fine-tuned tool-use model at 3B (e.g. functionary or a hermes variant) — smaller models trained specifically for structured output sometimes punch above their weight