LLM Extraction
One of the most valuable applications of LLMs is extracting structured data from unstructured text. Whether it is named entities from news articles, financial figures from reports, or tables from PDFs, LLMs excel at this task when combined with proper schemas and validation.
Named Entity Extraction
Extract structured entities from unstructured text:
import instructor
from pydantic import BaseModel, Field
from typing import List, Optional
from openai import OpenAI
client = instructor.from_openai(OpenAI())
class Entity(BaseModel):
name: str = Field(description="The exact name as it appears in the text")
type: str = Field(description="Category: PERSON, ORG, LOCATION, DATE, PRODUCT, EVENT")
description: str = Field(description="Brief description of the entity in context")
class ExtractionResult(BaseModel):
entities: List[Entity]
summary: str = Field(description="One-sentence summary of the text")
language: str = Field(description="Detected language of the text")
text = """Apple CEO Tim Cook announced the new Vision Pro headset at the
WWDC 2024 event in Cupertino, California. The device, priced at $3,499,
will be available in the United States starting February 2, 2024.
Samsung also announced competing products at their Seoul headquarters."""
result = client.chat.completions.create(
model="gpt-4o-mini",
response_model=ExtractionResult,
messages=[{"role": "user", "content": f"Extract entities from:\n\n{text}"}],
)
for entity in result.entities:
print(f"[{entity.type}] {entity.name}: {entity.description}")
LLMs can disambiguate entities that traditional NER tools struggle with. "Apple" in the text above is an organization, not a fruit. LLMs use context to make this distinction naturally.
Extracting Tables from Text
Tables embedded in text are notoriously hard to parse with regex. LLMs handle them elegantly:
class TableRow(BaseModel):
model: str
accuracy: float = Field(description="Accuracy as a decimal, e.g. 0.95")
f1_score: float = Field(description="F1 score as a decimal")
training_time_minutes: Optional[int] = None
class TableExtraction(BaseModel):
table_title: str
rows: List[TableRow]
notes: Optional[str] = None
text = """
Model Performance Comparison (Test Set)
| Model | Accuracy | F1 Score | Training Time |
|-------|----------|----------|---------------|
| BERT-base | 92.3% | 0.918 | 45 min |
| RoBERTa | 94.1% | 0.936 | 62 min |
| DistilBERT | 89.7% | 0.889 | 18 min |
Note: All models trained on 50k samples.
"""
result = client.chat.completions.create(
model="gpt-4o-mini",
response_model=TableExtraction,
messages=[{"role": "user", "content": f"Extract the table:\n{text}"}],
)
for row in result.rows:
print(f"{row.model}: accuracy={row.accuracy:.3f}, f1={row.f1_score:.3f}")
Extracting from PDFs
PDFs require an extra step — converting them to text before LLM extraction:
uv add pymupdf
import fitz # PyMuPDF
import instructor
from pydantic import BaseModel
from typing import List
from openai import OpenAI
def extract_text_from_pdf(pdf_path: str) -> str:
"""Extract all text from a PDF file."""
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
doc.close()
return text
class InvoiceItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
date: str
vendor_name: str
customer_name: str
items: List[InvoiceItem]
subtotal: float
tax: float
total: float
client = instructor.from_openai(OpenAI())
pdf_text = extract_text_from_pdf("invoice_2024_03.pdf")
invoice = client.chat.completions.create(
model="gpt-4o-mini",
response_model=Invoice,
max_retries=3,
messages=[
{
"role": "system",
"content": "Extract all invoice data accurately. Use null for any missing fields."
},
{
"role": "user",
"content": f"Extract from this invoice:\n\n{pdf_text[:4000]}"
}
],
)
print(f"Invoice #{invoice.invoice_number}")
print(f"Total: ${invoice.total:.2f}")
PyMuPDF extracts text but loses formatting. Tables may become jumbled text. For complex PDFs with tables or forms, consider:
- pdfplumber — Better table extraction
- marker — PDF-to-markdown conversion that preserves structure
- Vision LLMs — Send PDF pages as images to GPT-4o for visual understanding
Batch Extraction Pattern
For processing many documents, use a batch pattern:
import json
from pathlib import Path
def process_documents(directory: str) -> List[dict]:
results = []
for pdf_path in Path(directory).glob("*.pdf"):
text = extract_text_from_pdf(str(pdf_path))
try:
invoice = client.chat.completions.create(
model="gpt-4o-mini",
response_model=Invoice,
max_retries=2,
messages=[{"role": "user", "content": f"Extract:\n\n{text[:4000]}"}],
)
results.append(invoice.model_dump())
except Exception as e:
print(f"Failed to process {pdf_path.name}: {e}")
results.append({"file": pdf_path.name, "error": str(e)})
return results
# Save results
results = process_documents("./invoices/")
with open("extracted_invoices.json", "w") as f:
json.dump(results, f, indent=2)