Vector Embeddings for Semantic Search — Bench

Two Kinds of Embeddings

The how-llms-work-llama3 note covers generation embeddings: token ID → 4096-dimensional vector, used internally as the model processes a prompt. Those vectors are intermediate state — the model produces them to think, not to store.

Retrieval embeddings are different. A retrieval embedding model takes an entire piece of text — a sentence, a paragraph, a document — and produces a single fixed-size vector that represents its meaning. The goal isn’t next-token prediction. It’s: encode meaning into a point in space such that similar meanings land near each other.

Zion uses nomic-embed-text for this. It takes a bench note and returns a 768-dimensional vector. Every note becomes a point in 768-dimensional space.

The Geometry of Meaning

Retrieval embeddings work because meaning is geometric. Train an embedding model on enough text and it learns that:

“transformer architecture” and “attention mechanism” land near each other
“fermi estimation” and “order of magnitude” cluster together
“rust ownership” and “borrow checker” are close; “rust ownership” and “fermi estimation” are far apart

This isn’t hand-coded. It’s learned from co-occurrence patterns across massive corpora — words and phrases that appear in similar contexts get similar vectors.

The distance metric is cosine similarity — the angle between two vectors, not their magnitude:

cos(θ) = (A · B) / (|A| × |B|)

Range: −1 (opposite) to 1 (identical). In practice, well-matched documents score 0.85–0.98; unrelated ones score below 0.5.

Why cosine and not Euclidean distance? Euclidean distance is sensitive to vector magnitude — a long document produces a larger vector than a short one. Cosine similarity normalises that out, so a sentence and a paragraph about the same topic score high regardless of length.

The Semantic Search Pipeline

Zion’s search flow:

1. Index time (once, or on /reindex):
   - Read all bench notes from src/content/bench/
   - For each note: call nomic-embed-text via Ollama → 768-dim vector
   - Store: { slug, title, vector } in embeddings cache (JSON file)

2. Query time (on /find "query"):
   - Embed the query string → 768-dim vector
   - Compute cosine similarity against every cached note vector
   - Rank by similarity score
   - Return top-k results

The entire vector store fits in memory for ~200 notes (~200 × 768 × 4 bytes ≈ 600 KB). No vector database needed at this scale — a flat file and a dot product loop is fast enough.

Why nomic-embed-text

Two properties matter for a local retrieval model:

Asymmetric retrieval. The query (“how does attention work”) and the document (“a transformer layer computes Q·K/√d…”) are semantically related but phrased differently. A good embedding model is trained with query/document pairs, not just document/document similarity. nomic-embed-text uses a contrastive training objective — it learns to pull (query, relevant-doc) pairs together and push (query, irrelevant-doc) pairs apart.

Runs locally. 768 dimensions, small model, Ollama-compatible. The alternative — sending notes to an API for embedding — would mean every /reindex leaks your entire second brain to a third party.

Semantic Search vs Keyword Search

	Keyword	Semantic
Query: “power of ten”	Finds notes containing the phrase	Finds notes about order-of-magnitude reasoning even if exact phrase absent
Query: “how rust handles memory”	Misses notes that say “ownership” not “memory”	Finds ownership, borrow checker, lifetimes — same concept, different words
Speed	O(n) text scan	O(n) dot products — same complexity, but parallelisable
Failure mode	Synonym blindness	Metaphor confusion, false positives on tangentially related topics

Semantic search doesn’t replace keyword search — it complements it. For Zion’s use case (finding relevant notes to synthesise), semantic is the right default. For finding a specific slug or title, grep is faster.

The Limitation

Retrieval embeddings compress a document into one vector. That compression is lossy. A long note with multiple topics (say, a bench note that covers both circuit design and cost estimation) gets one vector — an average of its topics. It may score moderately on multiple queries rather than highly on any.

Chunking addresses this: split the document into paragraphs, embed each chunk separately, store chunk → parent mapping. Query matches a chunk, you retrieve the parent note. More vectors, better retrieval. Zion currently embeds whole notes — a known tradeoff for simplicity.