Skip to main content

Week 6 — Media Processing & Vision

This week dives into the multimodal side of AI engineering. You'll learn how to process images, detect objects, transcribe audio, and generate visual content. These skills are essential for building real-world AI applications that interact with the physical world.

Pages

#PageDescription
1Vision ModelsGPT-4o Vision, Gemini Flash/Pro, LLaVA
2Image ProcessingPreprocessing, annotation, bounding boxes, OpenCV
3Grounding DINOOpen-vocabulary object detection with Python
4Audio ProcessingWhisper transcription, speaker diarization
5Image GenerationSDXL, FLUX, ControlNet
6Grounding DINO TinyLightweight detection for low-cost inference
7ColPali (Multimodal RAG)Retrieve directly over PDF page images
8PolarsFast DataFrames for modern data prep
9DuckDB + ParquetSQL analytics layer over local files

Key Concepts

  • Computer Vision: Teaching machines to interpret visual information
  • Object Detection: Finding and classifying objects within images
  • Speech-to-Text: Converting spoken language to written text
  • Generative Models: Creating new images from text descriptions

Prerequisites

  • Basic understanding of neural networks
  • Python and NumPy fundamentals
  • Familiarity with APIs (Week 2)
Lab Connection

Lab 5 — Signature Detection applies the object detection skills from this week. You'll use Grounding DINO to locate signatures in documents.