Embeddings in Plain English
A practical, human-friendly guide to embeddings: what they are, what they unlock (search, clustering, dedupe), how distance and similarity work, how to choose models, chunk text, store vectors, keep indexes fresh, evaluate quality, and control cost.

Embeddings sound like something that belongs in a research paper, but in practice they’re just this:
Take text (or code, or images), turn it into a list of numbers that capture its meaning, and then compare those lists.
Those lists of numbers are embeddings. They’re the backbone of semantic search, retrieval‑augmented generation (RAG), recommendation, and pretty much any workflow where you want machines to deal with meaning, not just exact keywords. OpenAI’s docs describe them as numeric representations of concepts that make it easy to compare pieces of content by similarity.
If you’ve already met prompts in our earlier article as “specifications for what a model should do”, think of embeddings as the coordinates of your knowledge—how your data is laid out in space so the model can find and use it.
This guide walks through embeddings step by step, in plain language:
- What embeddings are (without math anxiety)
- What they unlock: search, clustering, dedupe, tagging, recommendations
- How “distance” and “similarity” actually work
- How to choose embedding models
- Why chunking text matters
- Where to store vectors and when you really need a “vector DB”
- How to keep your index fresh when data or models change
- How to evaluate retrieval (recall@k, MRR, nDCG)
- Cost and latency tricks (batching, caching)
- A quick “search your own notes” mini‑exercise
Along the way, we’ll connect this to how a platform like PractiqAI can teach these skills on real tasks, not just theory.
The short answer
An embedding is:
A fixed‑length vector (a list of numbers) that represents the meaning of some input (text, code, etc.), such that similar meanings live close together in that vector space.
You send some text to an embedding model, you get back a vector like:
[0.021, -0.004, 0.113, ..., 0.045]The vector itself is not “interpretable” by humans, but it has one superpower: if two texts are semantically similar, their vectors will be numerically similar.
Once you have those vectors, you can:
- Find similar items (semantic search)
- Group related items (clustering, topic discovery)
- Identify near‑duplicates
- Recommend related content
- Build retrieval systems that feed the right context into an LLM
The rest of the article is basically: how to use that superpower without hurting yourself.
Definition: numbers that capture meaning
Let’s make this concrete.
Imagine a 3D diagram where each point is a sentence:
- “I love dogs”
- “Puppies are adorable”
- “I’m debugging a memory leak in Go”
The first two sentences will land near each other; the third one will be somewhere far away in that space. Real embedding spaces are not 3D but often hundreds or thousands of dimensions, but the geometric intuition still holds.
Modern embedding models (like OpenAI’s text-embedding-3-small and text-embedding-3-large) are neural networks fine‑tuned to map text into these high‑dimensional vectors so that semantic relationships become geometric relationships.
Key properties:
- Fixed size – every piece of text (short or long) becomes a vector of the same length (e.g. 1536 or 3072 numbers).
- Orderless in spirit – they care about overall meaning more than exact word order.
- Task‑agnostic – the same vector can be used for search, clustering, classification, and more.
OpenAI’s own tagline: embeddings are sequences of numbers that capture the concepts inside content like text or code, powering tasks like retrieval and clustering.
Handy mental model
An embedding is like a semantic fingerprint: you can’t read it, but you can match it.
What they enable (search, clustering, dedupe, tagging)
Once your data has embeddings, your app suddenly gains “semantic senses” instead of just keyword eyes. OpenAI’s docs describe classic use cases: semantic search, clustering, recommendations, anomaly detection, classification, and more.
Let’s walk through the big ones.
Semantic search
Traditional search looks for overlapping words. If your user searches for “how to reset password”, you’ll miss an article titled “Account recovery steps” unless you add lots of synonyms and rules.
With embeddings, you:
- Embed all your documents once.
- Embed the user’s query.
- Find the documents whose vectors are closest to the query vector.
Because the space encodes meaning, “reset password”, “account access issue”, and “login problem” should all end up near the same help docs—even with zero exact word overlap.
This is also the standard first step in many RAG systems: embed chunks of your knowledge base, then retrieve the top‑k most similar chunks to feed into the LLM as context.
Clustering & topic discovery
If you put all your embeddings into a clustering algorithm, you get groups like:
- “Bug reports about performance”
- “Billing questions”
- “Feature requests about dark mode”
OpenAI’s original embeddings launch post explicitly calls out topic modeling and clustering as common tasks.
This is great when you:
- Don’t know what topics exist ahead of time.
- Want to explore a big archive (tickets, notes, docs) and see what naturally forms.
Deduplication and near‑duplicate detection
Raw text match will only catch exact duplicates. Embeddings let you spot near‑duplicates:
- Same document saved twice with minor edits
- StackOverflow answers that are 80% identical
- Support tickets that are almost copies of each other
You embed each item and then, for every new item, check whether there’s an existing vector that’s very close. High similarity → likely duplicate. This is similar to what OpenAI docs suggest for anomaly detection or outlier identification: vectors that are “too close” or “too far” can be flagged.
Tagging and classification
You can also use embeddings to label items, even with few or no training examples:
- Embed your item.
- Embed each tag’s description or a few examples of each class.
- Assign the tag whose embedding is closest.
OpenAI highlights classification and clustering as canonical embedding tasks, especially when you don’t want to fine‑tune a full classifier.
In PractiqAI, for instance, you could imagine a course where the task is: “Write a prompt that makes the model auto‑tag incoming support tickets into categories using only embeddings and similarity search.” The judge model would validate whether tickets end up in the right buckets—teaching you both prompting and vector‑thinking at once.
Distance metrics & similarity intuition
Okay, but how do we actually say “these two embeddings are close”?
Mathematically, you use a similarity or distance function:
- Similarity → bigger is more similar
- Distance → smaller is more similar
The three most common choices are:
- Cosine similarity – compares the angle between vectors
- Dot product – like cosine, but also sensitive to vector length
- Euclidean distance – the straight‑line distance between two points
For text embeddings, cosine similarity (or normalized dot product) is extremely common because you mostly care about direction (meaning), not magnitude. This is exactly the intuition behind many vector databases and retrieval systems.
If you want a formula, cosine similarity between vectors a and b is:
cosine_similarity(a, b) = (a · b) / (‖a‖ · ‖b‖)
Where:
a · bis the dot product (multiply each pair of components and sum)‖a‖and‖b‖are vector lengths (square root of sum of squares)
You don’t have to implement this yourself—every vector DB and numeric library has it. What matters intuitively:
- Angle ~ 0° → vectors point the same way → high similarity
- Angle ~ 90° → unrelated
- Angle ~ 180° → opposite concepts (rare in practice)
For most practical apps, you’ll pick a metric that your vector database or library supports (usually cosine or dot product) and stick to it. The rest is picking good thresholds and k (how many neighbors you retrieve).
Choosing embedding models
Not all embeddings are equal. Different models trade off quality, dimension size, cost, and latency.
OpenAI’s current recommended models are:
text-embedding-3-small– cheaper, fast, strong general performance; default for many retrieval tasks.text-embedding-3-large– higher quality, more dimensions (up to 3072), better on multilingual and demanding tasks.
From OpenAI’s own benchmarks:
text-embedding-3-smalldramatically improves multilingual retrieval vs. oldertext-embedding-ada-002, while being 5x cheaper per token.text-embedding-3-largehas the best retrieval quality and can even be “shortened” by specifying fewer dimensions without fully losing its semantic properties.
When choosing a model, think through:
- Use case
- RAG search over docs:
3-smallis usually plenty. - High‑stakes ranking / recommendations:
3-largeis safer. - Multilingual: check multilingual benchmarks; OpenAI’s docs emphasize improved multilingual performance for the new models.
- Cost vs. quality
Embedding cost scales with tokens × price per 1k tokens × how often you re‑embed. If you have millions of documents, even a slight price difference matters. For most people:
- Use
3-smallas a default. - Upgrade specific collections to
3-largewhere it measurably helps.
- Vector dimension
Higher dimension → more expressive space, but:
- Vectors take more storage.
- Vector DB queries are slightly slower.
- Some systems cap dimension length.
OpenAI lets you truncate embeddings by requesting fewer dimensions (e.g. 1024 instead of 3072) while keeping most of the performance—a trick they call out as “shortening embeddings.”
- Latency
Embedding calls are usually much faster than big generative models, but if you’re embedding large batches or very long texts, latency can still matter. OpenAI’s API reference shows you can send arrays of inputs in one request, which is key for performance.
Chunking text for better recall
Here’s a gotcha: long documents don’t make good single embeddings.
An embedding is one vector that tries to summarize everything inside. If you embed an entire 100‑page PDF as one vector, the representation will be blurry. OpenAI’s embedding cookbook explicitly recommends chunking long texts into smaller pieces before embedding.
Instead, you:
- Split documents into chunks – usually 200–500 tokens each with small overlaps.
- Embed each chunk separately.
- At query time, embed the query and retrieve the most similar chunks.
- Feed those chunks (and only those) into the LLM.
Why chunking works:
- It keeps each vector focused on one local topic.
- Retrieval is more precise—you pull in the exact paragraph that answers the question.
- You can mix chunks from different docs in one answer.
A few practical tips:
- Chunk by semantic boundaries where possible (sections, paragraphs, headings), not strictly by character count.
- Use overlap (e.g. 50 tokens) between chunks so that important sentences near boundaries don’t get “cut out”.
- Respect the embedding model’s max input length (e.g. ~8k tokens for OpenAI’s newer models); longer strings will error or be truncated.
In a PractiqAI task, you might be challenged to prompt an AI to infer a good chunking strategy for a messy document and then test how that affects retrieval quality—a very real workflow for anyone building RAG systems.
Storing vectors (vector DB vs. “just files”)
Once you’re generating embeddings, you need somewhere to put them.
Broadly you have two options:
- DIY storage – plain files, relational DBs, key‑value stores
- Dedicated vector database / vector store
OpenAI docs on file search and vector stores describe a hybrid approach: you upload files, they’re embedded into a managed vector store, and you query that store via an API.
When “just files” (or a simple DB) is enough
If you have:
- A small dataset (say, thousands of items, not millions)
- A simple “top‑k similarity” need
- Tight constraints on infrastructure
You can:
- Store embeddings as arrays in a relational DB (Postgres, SQLite) or even JSON files.
- Load them into memory at startup.
- Use a library like NumPy or a simple FAISS index to query them.
For hobby projects, personal note search, or small internal tools, this is totally fine.
When you want a vector DB
Dedicated vector databases (or hosted vector stores) add features that matter at scale:
- Approximate nearest neighbor (ANN) indexes (e.g. HNSW, IVF) so searches stay fast even with millions of vectors.
- Metadata filters (e.g. “only search docs where
language='en'andcustomer_tier='enterprise'”). - Hybrid search (combining keyword and vector scores).
- Replication, durability, monitoring, and growth.
OpenAI’s own vector store API in the Assistants stack is effectively a managed vector DB: you upload files, they’re embedded, sharded, and indexed; you get back a vector store ID you can search.
If you’re building a system that:
- Has lots of data (hundreds of thousands of chunks or more),
- Needs low latency for many users,
- Or needs rich filters and metadata,
then using a vector DB (hosted or self‑managed) is usually easier than reinventing that wheel.
Updating indexes & re‑embedding
Your data changes. Models improve. What happens to your vectors?
Incremental updates
For everyday operations:
- When you add content, you embed it and insert the new vector into your store.
- When you update content, you re‑embed that item and update its vector.
- When you delete content, you remove it and its embedding.
Most vector stores (including managed ones like OpenAI’s) support this sort of “upsert and delete” workflow.
When to re‑embed everything
Occasionally, you may want to re‑embed your entire corpus:
- You switch from an older model (e.g.
text-embedding-ada-002) totext-embedding-3-smallor3-largeto gain quality and/or cost improvements. - Your domain shifts dramatically (e.g. you add a new language or a completely new product line).
- Your evaluation metrics show the current embeddings underperform on new types of queries.
Re‑embedding millions of documents can be expensive. You can reduce the pain by:
- Doing it in waves: re‑embed the most frequently accessed or mission‑critical collections first.
- Using dimensions shortening (especially on
3-large) so the new vectors cost less to store and query. - Keeping the old index around briefly, so you can A/B test before flipping fully.
Don’t obsess over perfection
Embeddings are robust. You don’t need to re‑embed daily. With OpenAI’s newer models, migrations are more about gaining quality and cost wins than fixing catastrophic breakage. The Embeddings FAQ explicitly recommends the 3 series for most new applications and suggests switching when it improves your use case.
Evaluation (recall@k, MRR, nDCG)
“How do I know if my retrieval actually works?”
This is where ranking metrics come in. You don’t have to be a data scientist to use them; you just need the intuition.
Most of these metrics assume:
- You have a set of queries.
- For each query, you know which documents are relevant (“ground truth” labels).
- Your system produces a ranked list of documents for each query.
Recall@k – “Did we bring back the right stuff at all?”
Recall@k measures what fraction of all relevant items we managed to retrieve in the top k results. It’s simple and very interpretable: if there are 10 relevant docs and your top‑5 includes 4 of them, recall@5 = 0.4.
Use recall@k when:
- You care that relevant documents are somewhere in the top k, not necessarily at the top.
- You’re verifying “can this system find the right documents at all?”
MRR – “How high is the first good answer?”
Mean Reciprocal Rank (MRR) cares only about the position of the first relevant item in your results. For each query, you take 1 / (rank of first relevant document), then average across queries.
- If the first relevant doc is at position 1 → contribution = 1.0
- At position 2 → 0.5
- At position 10 → 0.1
- If there’s no relevant doc → 0
High MRR means users are likely to see something useful immediately.
Use MRR when:
- There’s typically one main answer you care about per query (e.g. Q&A).
- You want to optimize “the first click is good”.
nDCG – “How good is the whole ranking?”
nDCG (Normalized Discounted Cumulative Gain) looks at the entire ranked list and how relevant each position is, discounting lower ranks and normalizing so the score is always between 0 and 1.
Rough intuition:
- You assign higher “gain” to more relevant documents.
- The earlier they appear, the more they contribute.
- You compare the actual DCG to the ideal DCG (perfectly sorted), getting a normalized score.
Use nDCG when:
- You care about several relevant results, not just one.
- You want a holistic measure of ranking quality.
In practice, teams often track a small bundle of metrics (e.g. recall@5 and nDCG@10) to avoid over‑optimizing just one aspect. Retrieval blogs and vector DB vendors like Weaviate and Pinecone share good practical guides to these metrics with concrete examples.
Cost/latency tips (batching, caching)
Embeddings are relatively cheap and fast, but with large data they can still surprise you on the bill or slow your pipeline down. OpenAI’s FAQ and API reference recommend a few best practices.
1. Batch inputs
Most embedding APIs accept an array of inputs in a single request. Instead of sending 1,000 separate calls for 1,000 sentences, send, say, 50 calls with 20 sentences each.
Benefits:
- Less HTTP overhead.
- Better throughput.
- Often more stable latency.
OpenAI’s embeddings reference explicitly shows how to embed multiple inputs in one request with arrays of strings.
2. Cache aggressively
Embeddings are pure functions: same text + same model → same vector.
Therefore:
- Cache embeddings for all static content (docs, articles, product descriptions).
- Cache common queries if you do query embeddings (e.g. autocomplete, FAQ search).
- Store vectors next to raw text and metadata; your system should rarely re‑embed the same thing twice.
3. Pre‑compute vs. on‑the‑fly
For anything that doesn’t change often:
- Pre‑compute embeddings offline (e.g. nightly job).
- Store them in your vector store or DB.
Reserve real‑time embedding (inside user‑facing requests) for queries and genuinely dynamic content.
4. Trim and chunk wisely
Remember that cost is per token, not per string. OpenAI’s docs recommend managing input length carefully, and their cookbook details chunking strategies for long texts.
- Drop boilerplate text that doesn’t add meaning.
- Keep chunks compact but meaningful.
- Avoid embedding entire books when you just need the relevant chapters.
5. Use the right model
Finally, don’t automatically reach for the largest model. For many retrieval tasks, text-embedding-3-small gives great quality at a fraction of the cost of larger models. That’s exactly what OpenAI’s benchmarks show: better performance than previous generations at much lower prices.
Quick hands‑on: search your own notes
Let’s stitch this together with a tiny “thought experiment” you can turn into real code later.
You have:
- A directory of markdown notes.
- An OpenAI API key.
- Any scripting language you like (Python, Node, etc.).
Step 1: Collect and chunk your notes
Write a small script that:
- Reads all
.mdfiles. - Splits them into chunks of ~200–300 tokens (or paragraphs).
- For each chunk, stores:
idnote_pathchunk_text- maybe
tags,created_at, etc.
The OpenAI cookbook example on embedding long inputs gives practical chunking recipes you can adapt.
Step 2: Embed and store
Call the embeddings API with batches of chunk_text, using text-embedding-3-small.
In pseudocode:
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed_chunks(chunks):
resp = client.embeddings.create(
model="text-embedding-3-small",
input=[c["chunk_text"] for c in chunks]
)
for c, d in zip(chunks, resp.data):
c["embedding"] = d.embedding
return chunksStore these in a simple structure:
- For a tiny project: a JSON file or SQLite table.
- For more data: a vector DB or OpenAI vector store.
Step 3: Query your knowledge
When you type a question like:
“Where did I write about salary negotiation tactics?”
Your script:
- Embeds the query.
- Computes cosine similarity between the query embedding and all chunk embeddings.
- Returns the chunks with highest similarity.
In Python‑ish pseudocode:
def cosine_similarity(a, b):
a = np.array(a); b = np.array(b)
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
def search(query, chunks, top_k=5):
q_emb = client.embeddings.create(
model="text-embedding-3-small",
input=[query]
).data[0].embedding
scored = [
(cosine_similarity(q_emb, c["embedding"]), c)
for c in chunks
]
scored.sort(key=lambda x: x[0], reverse=True)
return scored[:top_k]Now you have personal semantic search. If you want to go one step further, you can:
- Feed those top chunks into an LLM and ask it to synthesize an answer.
- Highlight where in your notes the answer came from (for transparency).
- Track recall@k over a small labeled set of queries to know if changes improve or harm quality.
This is essentially what production retrieval systems do, just with more data and more guardrails. OpenAI’s retrieval and file‑search guides show how to do this with managed vector stores instead of rolling your own infrastructure.
Where to go next
If you want to deepen this beyond one article, start here:
- OpenAI Embeddings Guide – core concepts, model list, and examples for tasks like search and clustering.
- Embeddings FAQ – practical questions on pricing, token limits, model recommendations, and migration guidance.
- OpenAI File Search & Vector Stores – how to create vector stores, upload files, and use them for retrieval in the Assistants API and ChatGPT.
- Embedding long inputs cookbook – hands‑on notebook showing truncation and chunking strategies for texts longer than the model’s context.
- IR metric primers (recall@k, MRR, nDCG) – guides from vector DB vendors and ranking‑metric blogs that give concrete examples you can implement by hand.
In PractiqAI, embeddings show up naturally once tasks move from “toy prompts” to real workflows: building your own knowledge search, powering RAG over technical docs, triaging support tickets, or auto‑tagging massive content libraries. The nice part is that, as with prompts, you learn them by doing—writing prompts, inspecting retrieved chunks, tweaking models and chunk sizes, measuring recall, and iterating until things feel right and the metrics agree.
Once you’re comfortable thinking of text as points in space, a whole new class of “AI features” suddenly feels straightforward instead of magical.

PractiqAI Team
PractiqAI Team
PractiqAI designs guided drills and feedback loops that make learning with AI feel like muscle memory training. Follow along for product notes and workflow ideas from the team.
Ready to make AI practice part of your routine?
Explore interactive drills, daily streaks, and certification paths built by the PractiqAI team.
Explore coursesArticle snapshot
Why read: Embeddings in Plain English
A practical, human-friendly guide to embeddings: what they are, what they unlock (search, clustering, dedupe), how distance and similarity work, how to choose models, chunk text, store vectors, keep indexes fresh, evaluate quality, and control cost.
Reading time
19 min read
Published
2025-12-01
Practical takeaways
Built for operators who want actionable next steps—not just theory—so you can test ideas immediately.
What it covers
Embeddings, LLMs
Structured navigation
Use the table of contents to jump between key sections and return to examples faster.
Apply with PractiqAI
Pick a course or task after reading to reinforce the ideas with real prompts and AI feedback.
Latest articles
Fresh insights from the PractiqAI team.

What Are Vector Databases
A practical guide to vector databases: what they actually do (vs. simple vector stores), how indexing and hybrid search work, and how to model, operate, and choose them for real-world AI apps.

What is Chain‑of‑Thought (CoT)
A practical, opinionated guide to chain‑of‑thought prompting: what it is, how self‑consistency decoding works, when long reasoning helps, and when you should keep models terse and tool‑driven.

Context Windows: How Much the Model Can “Remember”
A practical guide to context windows in LLMs: what they are, how different models handle them, how to budget tokens, and when to reach for RAG or fine‑tuning instead of just “making the prompt longer.”