Document Parsing
Real-world data is trapped in PDFs, Word docs, and messy HTML.
Unstructured.io
A unified API for parsing PDFs, PPTX, HTML, etc., into clean JSON.
LlamaParse
Specifically designed to parse complex PDFs with tables and charts into Markdown for LLM ingestion.
Surya OCR
An open-source, highly accurate multi-lingual OCR model that outperforms Tesseract.
HTML to Markdown
Always convert HTML to Markdown before feeding it to an LLM. It saves tokens and removes noise.