DuckDB + Parquet
DuckDB is an in-process analytics database that makes SQL on files fast. Parquet is the columnar file format that dominates modern ML + analytics.
Together, they let you build “local warehouse” workflows with zero server management.
Learning goals
- Query Parquet directly (no import step)
- Build reproducible feature tables with SQL
- Use DuckDB as a prep layer for RAG corpora
Query Parquet directly
sql
-- No database server required
SELECT
user_id,
count(*) AS n_events
FROM read_parquet('data/events/*.parquet')
WHERE event = 'purchase'
GROUP BY user_id
ORDER BY n_events DESC
LIMIT 20;
Python integration
python
import duckdb
con = duckdb.connect("analytics.duckdb")
con.execute("CREATE VIEW IF NOT EXISTS events AS SELECT * FROM read_parquet('data/events/*.parquet')")
top = con.execute(
"SELECT country, count(*) n FROM events GROUP BY country ORDER BY n DESC"
).df()
print(top.head())
DuckDB for RAG prep
Common use cases:
- deduplicate documents
- compute document metadata (language, length)
- build training/eval sets from logs
Mini-lab (optional)
- Build a small Parquet dataset
- Create a DuckDB view over it
- Produce a
features.parquettable for downstream modeling