Squeezing Haikus into Gemma 3 via Apple Silicon — Bench

Small language models are notoriously bad at counting. A 270-million parameter model has no intrinsic concept of phonetics; it sees characters as token IDs and transitions between them using raw probability. Asking it to write a haiku is an exercise in frustration because the model cannot hear the words it produces. To force Google’s Gemma 3 models to speak strictly in the 5-7-5 syllable structure of classical haikus, the token distribution must be constrained through supervised fine-tuning. This requires a perfectly clean dataset and a local training loop that does not melt the hardware.

Local Metal-Accelerated SFT

Training transformer models usually requires renting remote Nvidia GPUs, but Apple’s MLX framework opens a path for local execution on unified memory. The training script uses the mlx-tune library to run Parameter-Efficient Fine-Tuning on an M1 GPU. We load the base instruction-tuned model and configure a LoRA adapter. The adapter targets all major projection matrices: q_proj, k_proj, v_proj, o_proj in the self-attention mechanism, and gate_proj, up_proj, down_proj in the feed-forward network. We set the rank to 16 and alpha to 32. The M1 chip does not have fans; the aluminum chassis of the laptop warms up as the GPU runs at 100% capacity.

Loss Curves and Overfitting

The 270-million parameter model adapts quickly but overfits easily. At iteration 1, the validation loss stands at 7.885, reflecting the model’s complete lack of alignment with the target conversational format. By iteration 200, validation loss drops to 1.217. It reaches 0.455 at iteration 600 and bottoms out around 0.181 at iteration 1400. Toward the end of the 1500-step run, the validation loss slightly ticks up to 0.231, suggesting the limits of the adapter’s capacity to absorb the stylistic constraint without degrading general text representation. For the larger Gemma 3 1B model, the extra parameters provide more headroom, allowing the loss to stabilize more smoothly over 2000 steps. The smaller 270M model represents the absolute lower bound of structural alignment: it either hits the 5-7-5 syllable target exactly or collapses into incoherent repeating tokens when the temperature rises above 0.6.

What Next

Merge the trained LoRA adapter weights back into the base Gemma 3 layers using mlx_lm.fuse.
Convert the fused SafeTensors weights to 16-bit GGUF files to preserve model precision.
Write a custom Ollama Modelfile to set strict temperature parameters and template boundaries for inference.