LLM Extraction

One of the most valuable applications of LLMs is extracting structured data from unstructured text. Whether it is named entities from news articles, financial figures from reports, or tables from PDFs, LLMs excel at this task when combined with proper schemas and validation.

Named Entity Extraction

Extract structured entities from unstructured text:

python

import instructor
from pydantic import BaseModel, Field
from typing import List, Optional
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class Entity(BaseModel):
    name: str = Field(description="The exact name as it appears in the text")
    type: str = Field(description="Category: PERSON, ORG, LOCATION, DATE, PRODUCT, EVENT")
    description: str = Field(description="Brief description of the entity in context")

class ExtractionResult(BaseModel):
    entities: List[Entity]
    summary: str = Field(description="One-sentence summary of the text")
    language: str = Field(description="Detected language of the text")

text = """Apple CEO Tim Cook announced the new Vision Pro headset at the
WWDC 2024 event in Cupertino, California. The device, priced at $3,499,
will be available in the United States starting February 2, 2024.
Samsung also announced competing products at their Seoul headquarters."""

result = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=ExtractionResult,
    messages=[{"role": "user", "content": f"Extract entities from:\n\n{text}"}],
)

for entity in result.entities:
    print(f"[{entity.type}] {entity.name}: {entity.description}")

Entity disambiguation

LLMs can disambiguate entities that traditional NER tools struggle with. "Apple" in the text above is an organization, not a fruit. LLMs use context to make this distinction naturally.

Extracting Tables from Text

Tables embedded in text are notoriously hard to parse with regex. LLMs handle them elegantly:

python

class TableRow(BaseModel):
    model: str
    accuracy: float = Field(description="Accuracy as a decimal, e.g. 0.95")
    f1_score: float = Field(description="F1 score as a decimal")
    training_time_minutes: Optional[int] = None

class TableExtraction(BaseModel):
    table_title: str
    rows: List[TableRow]
    notes: Optional[str] = None

text = """
Model Performance Comparison (Test Set)
| Model | Accuracy | F1 Score | Training Time |
|-------|----------|----------|---------------|
| BERT-base | 92.3% | 0.918 | 45 min |
| RoBERTa | 94.1% | 0.936 | 62 min |
| DistilBERT | 89.7% | 0.889 | 18 min |
Note: All models trained on 50k samples.
"""

result = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=TableExtraction,
    messages=[{"role": "user", "content": f"Extract the table:\n{text}"}],
)

for row in result.rows:
    print(f"{row.model}: accuracy={row.accuracy:.3f}, f1={row.f1_score:.3f}")

Extracting from PDFs

PDFs require an extra step — converting them to text before LLM extraction:

bash

uv add pymupdf

python

import fitz  # PyMuPDF
import instructor
from pydantic import BaseModel
from typing import List
from openai import OpenAI

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract all text from a PDF file."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

class InvoiceItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    date: str
    vendor_name: str
    customer_name: str
    items: List[InvoiceItem]
    subtotal: float
    tax: float
    total: float

client = instructor.from_openai(OpenAI())

pdf_text = extract_text_from_pdf("invoice_2024_03.pdf")

invoice = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Invoice,
    max_retries=3,
    messages=[
        {
            "role": "system",
            "content": "Extract all invoice data accurately. Use null for any missing fields."
        },
        {
            "role": "user",
            "content": f"Extract from this invoice:\n\n{pdf_text[:4000]}"
        }
    ],
)

print(f"Invoice #{invoice.invoice_number}")
print(f"Total: ${invoice.total:.2f}")

PDF text extraction limitations

PyMuPDF extracts text but loses formatting. Tables may become jumbled text. For complex PDFs with tables or forms, consider:

pdfplumber — Better table extraction
marker — PDF-to-markdown conversion that preserves structure
Vision LLMs — Send PDF pages as images to GPT-4o for visual understanding

Batch Extraction Pattern

For processing many documents, use a batch pattern:

python

import json
from pathlib import Path

def process_documents(directory: str) -> List[dict]:
    results = []
    for pdf_path in Path(directory).glob("*.pdf"):
        text = extract_text_from_pdf(str(pdf_path))
        try:
            invoice = client.chat.completions.create(
                model="gpt-4o-mini",
                response_model=Invoice,
                max_retries=2,
                messages=[{"role": "user", "content": f"Extract:\n\n{text[:4000]}"}],
            )
            results.append(invoice.model_dump())
        except Exception as e:
            print(f"Failed to process {pdf_path.name}: {e}")
            results.append({"file": pdf_path.name, "error": str(e)})
    return results

# Save results
results = process_documents("./invoices/")
with open("extracted_invoices.json", "w") as f:
    json.dump(results, f, indent=2)

Named Entity Extraction​

Extracting Tables from Text​

Extracting from PDFs​

Batch Extraction Pattern​

Named Entity Extraction

Extracting Tables from Text

Extracting from PDFs

Batch Extraction Pattern