Skip to main content

Scheduled Scraping

Data is only useful if it's fresh.

GitHub Actions Cron

You can run scrapers for free on a schedule using GitHub Actions.

yaml

name: Daily Scraper
on:
  schedule:
    - cron: '0 0 * * *' # Every day at midnight

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python scraper.py
      # Commit changes back to repo or push to DB

Deduplication

When running daily, don't save the same data twice.

Hash the content or use a unique ID (like an article URL).
Check against your database before inserting.

GitHub Actions Cron
Deduplication