Skip to main content

Embeddings

Embeddings are dense vector representations of text. They capture semantic meaning — similar texts have similar vectors. Embeddings are the foundation for semantic search, clustering, and retrieval-augmented generation (covered in Week 4).

What Are Embeddings?

An embedding model converts text into a fixed-length array of floating-point numbers (a vector). For example, a 1536-dimensional embedding looks like:

python
[0.0023, -0.0147, 0.0381, ..., -0.0072] # 1536 floats

The key insight: semantically similar texts produce vectors that are close together in the vector space.

python
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list:
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding

# Similar meanings → similar vectors
vec1 = get_embedding("The cat sat on the mat")
vec2 = get_embedding("A feline rested on a rug")
vec3 = get_embedding("Stock markets crashed today")

Cosine Similarity

Cosine similarity measures how close two vectors are, regardless of their magnitude:

python
import numpy as np

def cosine_similarity(a: list, b: list) -> float:
"""Calculate cosine similarity between two vectors."""
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compare embeddings
sim_1_2 = cosine_similarity(vec1, vec2) # ~0.85 (similar meaning)
sim_1_3 = cosine_similarity(vec1, vec3) # ~0.15 (different meaning)

print(f"Cat/Rug similarity: {sim_1_2:.3f}")
print(f"Cat/Stocks similarity: {sim_1_3:.3f}")
Score RangeInterpretation
0.9 - 1.0Near identical meaning
0.7 - 0.9Similar meaning
0.4 - 0.7Somewhat related
0.0 - 0.4Unrelated
< 0.0Opposing meaning (rare)
When to use embeddings vs. keywords
  • Keywords (BM25, grep): Exact matches, part numbers, names, code identifiers
  • Embeddings: Semantic search, finding "similar" content, recommendations, deduplication
  • Best of both: Combine them (covered in Week 4 — Hybrid Search)

Embedding Models Comparison

ModelDimensionsMax InputCost (per 1M tokens)
text-embedding-3-small15368,191 tokens$0.02
text-embedding-3-large30728,191 tokens$0.13
text-embedding-ada-00215368,191 tokens$0.10

Dimension Reduction

You can reduce embedding dimensions for faster search with minimal quality loss:

python
# Get a smaller embedding (lower dimensions = faster search, slightly less accurate)
response = client.embeddings.create(
input="Hello world",
model="text-embedding-3-small",
dimensions=256, # Reduce from default 1536
)
embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}") # 256

Building a Simple Search Engine

python
from openai import OpenAI
import numpy as np
import json

client = OpenAI()

class SimpleSearchEngine:
def __init__(self, model: str = "text-embedding-3-small"):
self.model = model
self.documents = []
self.embeddings = []

def add_documents(self, docs: list[str]):
"""Add documents to the search index."""
self.documents.extend(docs)
response = client.embeddings.create(input=docs, model=self.model)
new_embeddings = [item.embedding for item in response.data]
self.embeddings.extend(new_embeddings)

def search(self, query: str, top_k: int = 5) -> list[dict]:
"""Search for the most similar documents."""
query_embedding = get_embedding(query, self.model)

scores = []
for i, doc_embedding in enumerate(self.embeddings):
score = cosine_similarity(query_embedding, doc_embedding)
scores.append((i, score))

scores.sort(key=lambda x: x[1], reverse=True)
return [
{"document": self.documents[i], "score": round(score, 4)}
for i, score in scores[:top_k]
]

# Usage
engine = SimpleSearchEngine()
engine.add_documents([
"Python is a versatile programming language popular in data science.",
"Docker containers package applications with their dependencies.",
"Machine learning models learn patterns from training data.",
"SQL databases store structured data in tables with relationships.",
"Neural networks are inspired by the human brain's architecture.",
"Git version control tracks changes in source code over time.",
])

results = engine.search("How do AI models learn?")
for r in results:
print(f"[{r['score']:.3f}] {r['document']}")
Embeddings are not search

This simple engine stores all embeddings in memory and computes similarity against every document. For production use with millions of documents, you need a vector database (covered in Week 4). Vector databases use approximate nearest neighbor (ANN) algorithms for fast search.

Batch Embedding

For large datasets, use batch embedding to reduce API calls:

python
def batch_embed(texts: list[str], batch_size: int = 100) -> list[list]:
"""Embed texts in batches to avoid rate limits."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(input=batch, model="text-embedding-3-small")
all_embeddings.extend([item.embedding for item in response.data])
return all_embeddings

# Embed 1000 documents efficiently
documents = [f"Document {i} content..." for i in range(1000)]
embeddings = batch_embed(documents)
print(f"Generated {len(embeddings)} embeddings")