Skip to main content

Document Parsing

Real-world data is trapped in PDFs, Word docs, and messy HTML.

Unstructured.io

A unified API for parsing PDFs, PPTX, HTML, etc., into clean JSON.

LlamaParse

Specifically designed to parse complex PDFs with tables and charts into Markdown for LLM ingestion.

Surya OCR

An open-source, highly accurate multi-lingual OCR model that outperforms Tesseract.

HTML to Markdown

Always convert HTML to Markdown before feeding it to an LLM. It saves tokens and removes noise.