Don't use cosine similarity carelessly
Cosine similarity is AI's duct tape—everyone uses it to compare vectors, but it often matches questions to questions instead of answers, and surfaces to surfaces instead of meaning. Here's how to actually measure what you care about.
Read Original Summary used for search
TLDR
• Most models aren't trained on cosine similarity, so using it for comparison is arbitrary—you're measuring correlations the model never optimized for
• Even when models use cosine similarity, they learn the wrong kind: asking "What did I do with my keys?" matches "What did I do with my life?" (another question) over actual answers about key locations
• Best fix: train custom embeddings with Q/K matrices that transform vectors into task-specific spaces (queries vs keys) for asymmetric similarity
• Quick hacks: add context-setting prompts before embedding ("Nationality of {person}") or rewrite text to extract only relevant aspects before comparison
• The convenience of cosine similarity masks that you're often measuring surface patterns (writing style, question structure) rather than semantic relevance
In Detail
The author challenges the default practice of using cosine similarity to compare embeddings, arguing it's a convenient hack that often measures the wrong thing. The core issue is a triple mismatch: between what models were trained on, what cosine similarity measures, and what you actually care about. Most models use loss functions like cross-entropy on unnormalized vectors, not cosine similarity—so applying it afterward is arbitrary. Even when models ARE trained on cosine similarity, they optimize for whatever "similarity" appeared in training data, which might mean matching questions to questions rather than questions to answers.
The "keys" example demonstrates this failure mode perfectly: "What did I do with my keys?" scores highest similarity to "What did I do to my life?" (another existential question) rather than "They are on the table" (an actual answer). This happens because embeddings capture surface patterns—question structure, writing style, even typos—as much as semantic meaning. The author presents a hierarchy of solutions: directly using LLM queries with structured prompts (best but expensive), training custom embeddings with separate query and key transformation matrices Q and K (creating asymmetric similarity spaces), or quick fixes like prompt engineering to bias context ("This is a country that produced {person}" to focus on nationality) and preprocessing text to extract only relevant aspects before embedding.
The key insight is that cosine similarity's mathematical elegance and bounded output (-1 to 1) creates an illusion of objectivity while hiding that you're measuring arbitrary correlations. The solution isn't abandoning vector similarity but being intentional about what kind of similarity you need—then either training for it explicitly or engineering your inputs to focus on those aspects. The author's nationality experiment shows how the same person (Newton) can be "similar" to physicists or to British figures depending on how you frame the text before embedding, illustrating that similarity is always relative to how you construct the comparison space.