Embeddings and the Geometry of Meaning — Bench

The Idea That Meaning Has Shape

Distributional semantics — the hypothesis that words that appear in similar contexts have similar meanings — has roots in structural linguistics dating to Firth’s 1957 observation: “You shall know a word by the company it keeps.” The computational formalization is simple: represent each word as a vector of counts over the contexts it appears in, and measure semantic similarity as the geometric similarity (cosine similarity) between vectors.

This is a useful approximation but a crude one. The resulting vectors are sparse (most context counts are zero), high-dimensional (one dimension per context word), and carry no structure — the direction of the vector is arbitrary, and the coordinates don’t have interpretable meaning.

The revolution was the discovery, around 2013, that dense low-dimensional vectors learned by neural networks to predict word context encode semantic and syntactic structure as geometric structure. Relationships between concepts map to directions in the embedding space. Geometry encodes meaning.

Word2Vec

Tomas Mikolov and colleagues at Google published Word2Vec in 2013. The training task is simple: given a word, predict its neighboring words in a text corpus (or given the neighbors, predict the center word). A shallow neural network trained on this task learns a dense vector representation for each word such that words that appear in similar contexts end up with similar vectors.

The demonstration that captured attention was analogy completion via vector arithmetic. The vector for “king” minus the vector for “man” plus the vector for “woman” equals, approximately, the vector for “queen.” Not just this pair — the relationship “male-female analogy” is consistently represented as a direction in the embedding space. “Paris − France + Italy” ≈ “Rome.” “Walking − walked + swam” ≈ “swimming.” Grammatical relationships (present-past tense, singular-plural) and semantic relationships (country-capital, person-profession) are both encoded as consistent geometric directions.

This was genuinely surprising. The network was trained to predict context words. It wasn’t told what “analogy” means or that “king” and “queen” are gender-related. It inferred the geometric structure of these relationships from the statistical patterns in co-occurrence across a large corpus. The semantic relationships weren’t put in; they emerged.

What Embeddings Are Doing

The Word2Vec result invited a theoretical question that’s still not fully resolved: why does predicting context words produce geometrically structured semantic representations?

Levy and Goldberg (2014) showed that Word2Vec’s skip-gram model (predict neighbors from center) implicitly factorizes a matrix of Pointwise Mutual Information (PMI) between words and their contexts. PMI between a word w and a context c is high when w and c co-occur more often than chance. Factorizing the PMI matrix with low-rank vectors produces dense embeddings that capture the dominant structure of the co-occurrence distribution.

The geometric structure follows from this. Words that share many contexts (and therefore have high PMI with similar words) get embedded near each other. Relationships that hold between many pairs (gender relationships, tense relationships) are represented as consistent directions because they produce consistent PMI structure across many word pairs. The geometry is a consequence of the statistical structure of language, not a design choice.

This is the distributional hypothesis instantiated in low-dimensional geometry: meaning is in the distribution, the distribution has structure, and dense vector spaces represent that structure geometrically.

Contextual Embeddings

Word2Vec and GloVe produce static embeddings — one vector per word, regardless of context. “Bank” has the same vector whether it’s a river bank or a financial institution. This is a significant limitation for words with multiple senses and for understanding meaning in context.

The transformer architecture produces contextual embeddings. The representation of each token at each layer is computed from that token’s interaction with all other tokens in the sequence, via self-attention. The embedding for “bank” in “I deposited money at the bank” is different from its embedding in “the fish swam to the bank.” The context is baked into the representation.

BERT (Bidirectional Encoder Representations from Transformers, Devlin et al., 2018) trained a transformer on two tasks: masked language modeling (predict a randomly masked token from its context) and next sentence prediction. The representations learned are deeply contextual — they capture syntactic structure, semantic roles, coreference, and discourse relationships, all encoded in the geometry of the embedding space at each layer.

Probing studies of BERT and similar models found that different layers encode different types of information. Early layers capture local syntactic structure (part-of-speech, dependency relations). Middle layers capture longer-range syntactic relationships. Later layers capture semantic content. The hierarchy of representations roughly mirrors the linguistic hierarchy from surface form to meaning — not because it was designed that way, but because the training objective selected for representations that carry this information.

The Embedding Space as Knowledge Representation

The embedding space of a large language model is, in some sense, a compressed representation of the statistical regularities in the training corpus — which means it is, in some sense, a compressed representation of human knowledge encoded in text.

Concepts that frequently co-occur in context are embedded near each other. Abstract relationships that hold consistently across many instances are encoded as geometric directions. The model’s “knowledge” about the world — the factual associations, semantic relationships, and conceptual structure — is distributed across the geometry of this high-dimensional space.

This has practical consequences. When you do retrieval-augmented generation (RAG) — giving a language model access to a knowledge base by retrieving relevant documents — the retrieval is typically done by embedding the query and the documents into the same vector space and finding nearest neighbors by cosine similarity. The embedding space is acting as a semantic index: proximity in the space corresponds to relevance in meaning.

It also has interpretability consequences. The knowledge of a language model is not stored in discrete, locatable facts — it is distributed across parameter matrices, encoding correlations between context patterns and output distributions. There is no row in a lookup table where the fact “Paris is the capital of France” is stored. It is encoded in the geometry of the embeddings for “Paris,” “France,” “capital,” and their relations. This distributed encoding is what makes language models simultaneously impressive (they can retrieve and combine information flexibly) and unreliable (they can’t verify against ground truth and generate plausible-sounding fabrications).

Representation Learning as the Core Achievement

The broader significance of embedding research is what it says about learning. The systems described — Word2Vec, BERT, GPT — were not given structured knowledge. They were trained on raw text with simple self-supervised objectives (predict a masked word, predict the next word). The structure they learned — the geometric encoding of semantic relationships, the hierarchy of syntactic and semantic representations — emerged from the training objective applied to large amounts of data.

This is representation learning: learning what features of the input are relevant, what relationships to encode, what structure to extract — automatically, from data, without explicit supervision about what the representation should look like. The features are not hand-designed; they are discovered.

The success of deep learning is largely the success of representation learning. In image recognition, the feature engineering that dominated the 2000s (SIFT, HOG descriptors, hand-designed filters) was replaced by learned representations from convolutional networks. In NLP, the hand-crafted linguistic features of the 2000s (parse trees, part-of-speech tags, dependency relations) were replaced by contextual embeddings from transformers.

The Word2Vec result was an early, clean demonstration that distributed representations learned from raw data can encode meaningful structure. Everything since has been scaling this principle up and applying better architectures to do it.