Week 06 — Web Scraping & Data Processing
"Data is the new oil, but you have to drill for it."
AI applications are only as good as the data they consume. This week focuses on extracting data from the messy, real-world internet and structuring it for LLMs.
Topics Covered
- Playwright & Selenium
- Crawl4AI
- Firecrawl & Apify
- Scrapy
- Anti-bot Patterns
- Scheduled Scraping
- Document Parsing
- DuckDB + Parquet
- Firestore Database
- Vision Models for Scraping
- Image Processing Pipeline
- Speech AI — TTS & STT
- Video Understanding
- LLM Architecture
Hands-On Labs & Capstones
- Job Posting Scraper & Tracker (Capstone)
- AI Signature Detection & Cropper (Capstone)
- Live Multilingual Travel Translator (Capstone)
- Scheduled Scraper with GitHub Actions (Lab)
Get ready to deal with captchas, broken DOMs, and massive datasets.