Adapter Fusion and the local Ollama Runtime
Fusing LoRA weights and compiling to GGUF completes the loop, bringing a fine-tuned structural model back to local runtimes for offline validation.
A fine-tuned adapter is a set of delta weights, useless without its base. To run the model outside the Python development environment, these adapters must be fused back into the primary model layers. The final stage of the Gemmaiku pipeline requires merging the LoRA parameters, exporting to a GGUF container, and orchestrating the runtime through Ollama and OpenWebUI. This local loop is critical: it validates whether the fine-tuned constraints hold up when the model is exposed to adversarial queries.
Weight Fusion and Compilation
The mlx_lm.fuse module merges the adapter weights back into the base Gemma 3 parameters. The output is a complete directory of weights in SafeTensors format. To run this fused model in Ollama, it must be compiled into a single GGUF file. Using convert_hf_to_gguf.py from the llama.cpp repository, the model weights are converted. A 16-bit float quantization level is maintained to preserve the precision of the tiny 270M and 1B parameters. When models are this small, aggressive 4-bit quantization degrades the fragile stylistic alignment, causing the model to output broken lines or stutter.
Enforcing the Modelfile Boundary
A custom Modelfile provides the system prompt and configuration parameters needed to keep the model in character. Ollama reads this configuration to instantiate the model:
FROM ./gemmaiku-1b.gguf
SYSTEM "You are a specialized Haiku bot. You only speak in 3 lines of 5-7-5 syllables. No chatter."
TEMPLATE """<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
"""
PARAMETER stop "<end_of_turn>"
PARAMETER temperature 0.6
PARAMETER repeat_penalty 1.1
The system prompt acts as a reinforcement layer, but the SFT training does the heavy lifting. If the model is queried without the SFT weights, it simply ignores the system prompt when pushed. With the fused SFT weights, the 5-7-5 boundary is hard-coded into the model’s token probabilities.
Testing Adversarial Resistance
Once the GGUF model is loaded into Ollama and routed to OpenWebUI via Docker, it faces adversarial testing. A typical test is to ask the model to break character: write a long essay, explain a complex code block, or provide a simple yes/no response. The model displays a striking rigidity. When asked to write a long essay on the Roman Empire, it compresses the entire rise and fall into seventeen syllables. When asked for a simple yes or no, it pads the response with two extra lines of description to maintain the 5-7-5 count. The SFT training has effectively overwritten its ability to produce standard prose.
What Next
- Build a local router model to intercept queries, deciding whether to steer them to the haiku bot or standard prose models.
- Test the throughput limits of the Docker-encapsulated OpenWebUI setup compared to raw command-line inference.
- Benchmark how higher temperatures (above 0.8) degrade the 5-7-5 constraint adherence during continuous generation.