Vector Embeddings for Semantic Search
How retrieval embeddings differ from generation embeddings — the query→embed→rank pipeline that powers semantic search in Zion.
Two Kinds of Embeddings
The how-llms-work-llama3 note covers generation embeddings: token ID → 4096-dimensional vector, used internally as the model processes a prompt. Those vectors are intermediate state — the model produces them to think, not to store.
Retrieval embeddings are different. A retrieval embedding model takes an entire piece of text — a sentence, a paragraph, a document — and produces a single fixed-size vector that represents its meaning. The goal isn’t next-token prediction. It’s: encode meaning into a point in space such that similar meanings land near each other.
Zion uses nomic-embed-text for this. It takes a bench note and returns a 768-dimensional vector. Every note becomes a point in 768-dimensional space.
The Geometry of Meaning
Retrieval embeddings work because meaning is geometric. Train an embedding model on enough text and it learns that:
- “transformer architecture” and “attention mechanism” land near each other
- “fermi estimation” and “order of magnitude” cluster together
- “rust ownership” and “borrow checker” are close; “rust ownership” and “fermi estimation” are far apart
This isn’t hand-coded. It’s learned from co-occurrence patterns across massive corpora — words and phrases that appear in similar contexts get similar vectors.
The distance metric is cosine similarity — the angle between two vectors, not their magnitude:
cos(θ) = (A · B) / (|A| × |B|)
Range: −1 (opposite) to 1 (identical). In practice, well-matched documents score 0.85–0.98; unrelated ones score below 0.5.
Why cosine and not Euclidean distance? Euclidean distance is sensitive to vector magnitude — a long document produces a larger vector than a short one. Cosine similarity normalises that out, so a sentence and a paragraph about the same topic score high regardless of length.
The Semantic Search Pipeline
Zion’s search flow:
1. Index time (once, or on /reindex):
- Read all bench notes from src/content/bench/
- For each note: call nomic-embed-text via Ollama → 768-dim vector
- Store: { slug, title, vector } in embeddings cache (JSON file)
2. Query time (on /find "query"):
- Embed the query string → 768-dim vector
- Compute cosine similarity against every cached note vector
- Rank by similarity score
- Return top-k results
The entire vector store fits in memory for ~200 notes (~200 × 768 × 4 bytes ≈ 600 KB). No vector database needed at this scale — a flat file and a dot product loop is fast enough.
Why nomic-embed-text
Two properties matter for a local retrieval model:
Asymmetric retrieval. The query (“how does attention work”) and the document (“a transformer layer computes Q·K/√d…”) are semantically related but phrased differently. A good embedding model is trained with query/document pairs, not just document/document similarity. nomic-embed-text uses a contrastive training objective — it learns to pull (query, relevant-doc) pairs together and push (query, irrelevant-doc) pairs apart.
Runs locally. 768 dimensions, small model, Ollama-compatible. The alternative — sending notes to an API for embedding — would mean every /reindex leaks your entire second brain to a third party.
Semantic Search vs Keyword Search
| Keyword | Semantic | |
|---|---|---|
| Query: “power of ten” | Finds notes containing the phrase | Finds notes about order-of-magnitude reasoning even if exact phrase absent |
| Query: “how rust handles memory” | Misses notes that say “ownership” not “memory” | Finds ownership, borrow checker, lifetimes — same concept, different words |
| Speed | O(n) text scan | O(n) dot products — same complexity, but parallelisable |
| Failure mode | Synonym blindness | Metaphor confusion, false positives on tangentially related topics |
Semantic search doesn’t replace keyword search — it complements it. For Zion’s use case (finding relevant notes to synthesise), semantic is the right default. For finding a specific slug or title, grep is faster.
The Limitation
Retrieval embeddings compress a document into one vector. That compression is lossy. A long note with multiple topics (say, a bench note that covers both circuit design and cost estimation) gets one vector — an average of its topics. It may score moderately on multiple queries rather than highly on any.
Chunking addresses this: split the document into paragraphs, embed each chunk separately, store chunk → parent mapping. Query matches a chunk, you retrieve the parent note. More vectors, better retrieval. Zion currently embeds whole notes — a known tradeoff for simplicity.