Week 6 — Media Processing & Vision
This week dives into the multimodal side of AI engineering. You'll learn how to process images, detect objects, transcribe audio, and generate visual content. These skills are essential for building real-world AI applications that interact with the physical world.
Pages
| # | Page | Description |
|---|---|---|
| 1 | Vision Models | GPT-4o Vision, Gemini Flash/Pro, LLaVA |
| 2 | Image Processing | Preprocessing, annotation, bounding boxes, OpenCV |
| 3 | Grounding DINO | Open-vocabulary object detection with Python |
| 4 | Audio Processing | Whisper transcription, speaker diarization |
| 5 | Image Generation | SDXL, FLUX, ControlNet |
| 6 | Grounding DINO Tiny | Lightweight detection for low-cost inference |
| 7 | ColPali (Multimodal RAG) | Retrieve directly over PDF page images |
| 8 | Polars | Fast DataFrames for modern data prep |
| 9 | DuckDB + Parquet | SQL analytics layer over local files |
Key Concepts
- Computer Vision: Teaching machines to interpret visual information
- Object Detection: Finding and classifying objects within images
- Speech-to-Text: Converting spoken language to written text
- Generative Models: Creating new images from text descriptions
Prerequisites
- Basic understanding of neural networks
- Python and NumPy fundamentals
- Familiarity with APIs (Week 2)
Lab Connection
Lab 5 — Signature Detection applies the object detection skills from this week. You'll use Grounding DINO to locate signatures in documents.