SystemsConcepts8 min readintermediate

Embeddings Explained Like You're Human

Similarity over meaning, and why search works until it doesn't.

#embeddings#search#meaning#similarity
Vector space visualization showing semantic clustering of points

Key takeaways

  • Embeddings turn meaning into distance: closer vectors imply similar intent.
  • Similarity helps you find related items, not necessarily correct ones.
  • Bias in data becomes bias in vector space.
  • Embeddings drift as language and data change.

What is an embedding?

An embedding is a mathematical representation of a concept (such as a word, sentence, or image) as a high-dimensional vector. By converting qualitative meaning into quantitative coordinates, embeddings allow computers to calculate semantic similarity mathematically: concepts with similar meanings are located closer together in the vector space.

In practice, clarity at boundaries reduces downstream errors more than late-stage tuning.

Act I: The fundamentals

Embedding space as distance

Imagine a library where books are not organized by author or title, but by the ideas they contain. Books about dogs would be in one corner, books about cats in another, and books about dog training would be somewhere in between. This is what embeddings do for language.

An embedding model takes a piece of text and maps it to a vector. The model is trained on vast amounts of text to learn the relationships between words. For example, the vectors for "king" and "queen" will be closer together than the vectors for "king" and "cabbage." More interestingly, the relationship between vectors can capture meaning, as in the famous example: vector('king') - vector('man') + vector('woman') ≈ vector('queen').

Embedding Space A 2D space showing related words clustered together, like 'dog' and 'puppy', and 'cat' and 'kitten'. dog puppy cat kitten car
In embedding space, proximity represents semantic similarity.

Act II: The modern paradigm

Semantic search in practice

This ability to capture meaning as proximity has revolutionized how we work with unstructured data. The primary application is semantic search. Instead of matching keywords, we can now search for meaning. A search for "small dog" can return documents containing the word "puppy," even if "small dog" is never explicitly mentioned.

This powers a wide range of applications:

  • Recommendation engines: Find items (products, articles, songs) similar to what a user has liked.
  • Classification: Categorize text by finding the closest known category in the embedding space.
  • Clustering: Identify groups of related documents in a large corpus without pre-defined labels.

These systems work by converting both the query and the documents into embeddings and then finding the "nearest neighbors" in the vector space.

Act III: Principles in practice

Limits, bias, and drift

The key limitation of embeddings is that they model statistical similarity, not true understanding or factual accuracy. They learn from the text they are trained on, and they will faithfully reproduce its biases and associations. If the training data frequently associates "doctor" with "he" and "nurse" with "she," the embedding space will reflect that bias.

Furthermore, context is critical. The word "bank" has very different meanings in "river bank" and "investment bank." While modern models are better at handling this, a single, static embedding for a word can still be a source of error. The embedding is an approximation of meaning, not meaning itself.

Another practical limit is domain mismatch. A general-purpose embedding model may perform well on broad internet language while underperforming on specialized legal, medical, or internal business vocabulary. When teams skip this evaluation step, retrieval quality silently drops and appears as "hallucination," even when the root cause is weak candidate retrieval.

For production use, monitor drift directly: sample recurring queries, track nearest-neighbor quality over time, and re-index when corpus semantics change. Embeddings are not set-and-forget infrastructure; they are living representations of your language layer.

For related systems context, see Systems 001: Foundations and From Prompt to Production. To see how embeddings and semantic retrieval are validated in practice, explore our retrieval and grounding evaluation kit.

What this changes in practice

Use embeddings for finding things that are "like" each other, but do not mistake that similarity for a guarantee of factual correctness or conceptual understanding.

Proof Block

  • Core conceptual reference for similarity and search in vector space
  • Referenced in retrieval-augmented-generation-in-plain-terms.mdx

FAQ

What is an embedding in simple terms?

An embedding turns concepts (words, sentences, images) into lists of numbers called vectors. These vectors place the concept in a semantic space where distance equals difference in meaning. Similar concepts end up close together.

Why does similarity search sometimes return wrong results?

Similarity search finds semantically related items based on training data patterns, not actual correctness. A question about "bank" might retrieve information about rivers if the training data had more financial contexts. Embeddings capture statistical regularities, not truth.

How do embeddings drift over time?

Language changes. New words emerge, meanings shift, and domain-specific usage evolves. When embeddings are built on outdated data, they no longer match current language patterns. Regular reindexing or monitoring for retrieval quality degradation is needed.