Embeddings
Embeddings are dense vector representations of text. They capture semantic meaning — similar texts have similar vectors. Embeddings are the foundation for semantic search, clustering, and retrieval-augmented generation (covered in Week 4).
What Are Embeddings?
An embedding model converts text into a fixed-length array of floating-point numbers (a vector). For example, a 1536-dimensional embedding looks like:
[0.0023, -0.0147, 0.0381, ..., -0.0072] # 1536 floats
The key insight: semantically similar texts produce vectors that are close together in the vector space.
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list:
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
# Similar meanings → similar vectors
vec1 = get_embedding("The cat sat on the mat")
vec2 = get_embedding("A feline rested on a rug")
vec3 = get_embedding("Stock markets crashed today")
Cosine Similarity
Cosine similarity measures how close two vectors are, regardless of their magnitude:
import numpy as np
def cosine_similarity(a: list, b: list) -> float:
"""Calculate cosine similarity between two vectors."""
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Compare embeddings
sim_1_2 = cosine_similarity(vec1, vec2) # ~0.85 (similar meaning)
sim_1_3 = cosine_similarity(vec1, vec3) # ~0.15 (different meaning)
print(f"Cat/Rug similarity: {sim_1_2:.3f}")
print(f"Cat/Stocks similarity: {sim_1_3:.3f}")
| Score Range | Interpretation |
|---|---|
| 0.9 - 1.0 | Near identical meaning |
| 0.7 - 0.9 | Similar meaning |
| 0.4 - 0.7 | Somewhat related |
| 0.0 - 0.4 | Unrelated |
| < 0.0 | Opposing meaning (rare) |
- Keywords (BM25, grep): Exact matches, part numbers, names, code identifiers
- Embeddings: Semantic search, finding "similar" content, recommendations, deduplication
- Best of both: Combine them (covered in Week 4 — Hybrid Search)
Embedding Models Comparison
| Model | Dimensions | Max Input | Cost (per 1M tokens) |
|---|---|---|---|
| text-embedding-3-small | 1536 | 8,191 tokens | $0.02 |
| text-embedding-3-large | 3072 | 8,191 tokens | $0.13 |
| text-embedding-ada-002 | 1536 | 8,191 tokens | $0.10 |
Dimension Reduction
You can reduce embedding dimensions for faster search with minimal quality loss:
# Get a smaller embedding (lower dimensions = faster search, slightly less accurate)
response = client.embeddings.create(
input="Hello world",
model="text-embedding-3-small",
dimensions=256, # Reduce from default 1536
)
embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}") # 256
Building a Simple Search Engine
from openai import OpenAI
import numpy as np
import json
client = OpenAI()
class SimpleSearchEngine:
def __init__(self, model: str = "text-embedding-3-small"):
self.model = model
self.documents = []
self.embeddings = []
def add_documents(self, docs: list[str]):
"""Add documents to the search index."""
self.documents.extend(docs)
response = client.embeddings.create(input=docs, model=self.model)
new_embeddings = [item.embedding for item in response.data]
self.embeddings.extend(new_embeddings)
def search(self, query: str, top_k: int = 5) -> list[dict]:
"""Search for the most similar documents."""
query_embedding = get_embedding(query, self.model)
scores = []
for i, doc_embedding in enumerate(self.embeddings):
score = cosine_similarity(query_embedding, doc_embedding)
scores.append((i, score))
scores.sort(key=lambda x: x[1], reverse=True)
return [
{"document": self.documents[i], "score": round(score, 4)}
for i, score in scores[:top_k]
]
# Usage
engine = SimpleSearchEngine()
engine.add_documents([
"Python is a versatile programming language popular in data science.",
"Docker containers package applications with their dependencies.",
"Machine learning models learn patterns from training data.",
"SQL databases store structured data in tables with relationships.",
"Neural networks are inspired by the human brain's architecture.",
"Git version control tracks changes in source code over time.",
])
results = engine.search("How do AI models learn?")
for r in results:
print(f"[{r['score']:.3f}] {r['document']}")
This simple engine stores all embeddings in memory and computes similarity against every document. For production use with millions of documents, you need a vector database (covered in Week 4). Vector databases use approximate nearest neighbor (ANN) algorithms for fast search.
Batch Embedding
For large datasets, use batch embedding to reduce API calls:
def batch_embed(texts: list[str], batch_size: int = 100) -> list[list]:
"""Embed texts in batches to avoid rate limits."""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = client.embeddings.create(input=batch, model="text-embedding-3-small")
all_embeddings.extend([item.embedding for item in response.data])
return all_embeddings
# Embed 1000 documents efficiently
documents = [f"Document {i} content..." for i in range(1000)]
embeddings = batch_embed(documents)
print(f"Generated {len(embeddings)} embeddings")