Skip to main content

Week 06 — Web Scraping & Data Processing

"Data is the new oil, but you have to drill for it."

AI applications are only as good as the data they consume. This week focuses on extracting data from the messy, real-world internet and structuring it for LLMs.

Topics Covered

  1. Playwright & Selenium
  2. Crawl4AI
  3. Firecrawl & Apify
  4. Scrapy
  5. Anti-bot Patterns
  6. Scheduled Scraping
  7. Document Parsing
  8. DuckDB + Parquet
  9. Firestore Database
  10. Vision Models for Scraping
  11. Image Processing Pipeline
  12. Speech AI — TTS & STT
  13. Video Understanding
  14. LLM Architecture

Hands-On Labs & Capstones

  • Job Posting Scraper & Tracker (Capstone)
  • AI Signature Detection & Cropper (Capstone)
  • Live Multilingual Travel Translator (Capstone)
  • Scheduled Scraper with GitHub Actions (Lab)

Get ready to deal with captchas, broken DOMs, and massive datasets.