Skip to main content

Polars

Polars is a modern DataFrame engine that’s fast by default and encourages reproducible pipelines via lazy execution.

Learning goals

  • Know when to use Polars instead of pandas
  • Write lazy pipelines that push down filters/projections
  • Read/write Parquet efficiently

Eager vs lazy

  • Eager: runs immediately (pl.DataFrame)
  • Lazy: builds a query plan, then optimizes (pl.scan_parquet(...).filter(...).select(...))
python
import polars as pl

# Lazy scan (does not read whole file yet)
df = (
pl.scan_parquet("data/events.parquet")
.filter(pl.col("event") == "purchase")
.group_by("user_id")
.agg(pl.count().alias("purchases"))
)

result = df.collect() # executes optimized plan
Why lazy is huge

Lazy execution enables predicate pushdown and column pruning — the main reason Polars feels “unfairly fast”.

Common patterns

Joins

python
users = pl.scan_parquet("data/users.parquet")
events = pl.scan_parquet("data/events.parquet")

joined = events.join(users, on="user_id", how="left").select(
"user_id", "country", "event", "timestamp"
)

Window functions

python
ranked = (
pl.scan_parquet("data/events.parquet")
.with_columns(
pl.col("timestamp").rank("dense").over("user_id").alias("event_rank")
)
)

Interop with DuckDB

Polars + DuckDB is a common 2026 stack:

  • Polars: fast ETL + feature engineering
  • DuckDB: SQL analytics + joins across many Parquet files

See: DuckDB + Parquet.

Mini-lab (optional)

Rewrite a pandas ETL to Polars:

  • show runtime and memory
  • output Parquet
  • validate row counts + key aggregates match