Skip to main content

DuckDB + Parquet

DuckDB is an in-process analytics database that makes SQL on files fast. Parquet is the columnar file format that dominates modern ML + analytics.

Together, they let you build “local warehouse” workflows with zero server management.

Learning goals

  • Query Parquet directly (no import step)
  • Build reproducible feature tables with SQL
  • Use DuckDB as a prep layer for RAG corpora

Query Parquet directly

sql
-- No database server required
SELECT
user_id,
count(*) AS n_events
FROM read_parquet('data/events/*.parquet')
WHERE event = 'purchase'
GROUP BY user_id
ORDER BY n_events DESC
LIMIT 20;

Python integration

python
import duckdb

con = duckdb.connect("analytics.duckdb")
con.execute("CREATE VIEW IF NOT EXISTS events AS SELECT * FROM read_parquet('data/events/*.parquet')")

top = con.execute(
"SELECT country, count(*) n FROM events GROUP BY country ORDER BY n DESC"
).df()
print(top.head())

DuckDB for RAG prep

Common use cases:

  • deduplicate documents
  • compute document metadata (language, length)
  • build training/eval sets from logs

Mini-lab (optional)

  • Build a small Parquet dataset
  • Create a DuckDB view over it
  • Produce a features.parquet table for downstream modeling