How to Get Started with Polars in Python
Last updated Apr 22, 2026

Polars is a Python DataFrame library that does the same job as pandas: read files, filter rows, aggregate numbers, join tables. The difference is speed and memory efficiency. On a join across two 10-million-row DataFrames, Polars completes in 2.1 seconds; pandas takes 18.7 seconds. On a 12GB clickstream file, Polars processes it using 2GB of peak memory; the same operation in pandas triggers a MemoryError on a 16GB machine. This guide covers setup, basic operations, lazy evaluation, and when Polars is worth using over pandas.
Why Polars Is Faster
Pandas stores data in NumPy arrays and processes many operations sequentially on a single CPU core. Polars uses Apache Arrow's columnar memory format, runs operations in parallel across all available CPU cores, and includes a query optimizer that rewrites your code into a more efficient execution plan before running it.
The speed difference varies by operation. Joins benefit the most — 9x faster in benchmarks across 10-million-row datasets. Aggregations run about 2.6x faster. Filtering runs 4.6x faster on large files. String-heavy regex operations are one area where Polars is currently slower than pandas by roughly 40%, so workloads dominated by regex extraction may still favor pandas.
Polars 1.0 shipped in July 2024, marking the library's first stable API. The expression-based syntax has stayed consistent since then.
Installing Polars
Install Polars with pip. No additional dependencies are required:
pip install polars
Verify the install:
import polars as pl
print(pl.__version__)
Polars ships self-contained, with all Rust binaries bundled in the package. Unlike pandas, it does not depend on NumPy.
Reading Data Files
Polars reads CSV, Parquet, and JSON files directly. The function names are similar to pandas.
CSV:
df = pl.read_csv("sales_data.csv", try_parse_dates=True)
print(df.shape)
print(df.head())
The try_parse_dates=True argument automatically detects date columns. In pandas, date parsing requires an explicit parse_dates list.
Parquet:
df = pl.read_parquet("transactions.parquet")
Parquet is a columnar format that pairs well with Polars. If your data source allows it, converting large CSVs to Parquet before loading significantly cuts read times.
Filtering, Selecting, and Grouping
The key syntax difference from pandas is that Polars uses pl.col("column_name") expressions rather than bracket notation.
Filter rows:
high_revenue = df.filter(pl.col("revenue") > 10000)
q4_high = df.filter(
(pl.col("revenue") > 10000) & (pl.col("quarter") == 4)
)
Select columns:
subset = df.select(["product_name", "revenue", "date"])
Groupby and aggregation:
summary = df.group_by("product_category").agg(
pl.col("revenue").sum().alias("total_revenue"),
pl.col("units_sold").mean().alias("avg_units"),
pl.len().alias("transaction_count")
)
Multiple aggregations run in a single pass. In pandas, chaining aggregations often requires .apply() or lambda functions, which execute sequentially.
Joining two tables:
joined = customers.join(orders, on="customer_id", how="left")
Join types supported: inner, left, right, full, semi, anti, cross.
Lazy Evaluation: Reading Only What You Need
Polars has two execution modes: eager and lazy.
Eager mode executes each operation immediately and returns a result. It is simpler to read and debug. Lazy mode builds a query plan across multiple operations, optimizes the plan, and executes everything at once when you call .collect().
result = (
pl.scan_csv("large_file.csv") # scan, do not load yet
.filter(pl.col("status") == "active") # add to plan
.group_by("region") # add to plan
.agg(pl.col("revenue").sum()) # add to plan
.collect() # execute the optimized plan
)
pl.scan_csv() does not load the file into memory. It reads only the columns and rows your query actually needs. This is why Polars can process a 12GB file with 2GB of peak memory — it never materializes the full dataset.
For simple analyses on files under 1GB, eager mode is easier to work with. Switch to lazy mode when you hit memory limits or need to chain five or more operations on a large file.
When to Use Polars vs Pandas
Polars is the better choice when:
- Your dataset is larger than 1GB
- You need to run joins on millions of rows
- You are hitting memory limits with pandas
- Your pipeline chains multiple filters and aggregations
Stick with pandas when:
- Your data is under 1GB and fits comfortably in memory
- You rely on libraries that only accept pandas DataFrames (scikit-learn, matplotlib, seaborn)
- Your workflow includes complex regex operations across many rows
- You have an existing pandas codebase and the migration cost outweighs the speed gain
The two libraries are not directly interchangeable. Polars uses expression-based syntax throughout, which means porting existing pandas code requires rewriting, not just renaming functions.
Converting Between Polars and Pandas
When working with libraries that require pandas DataFrames, conversion is straightforward. Both directions go through the Apache Arrow format internally, which keeps them fast:
# Polars to pandas
pandas_df = polars_df.to_pandas()
# Pandas to Polars
polars_df = pl.from_pandas(pandas_df)
A Complete Example
This workflow loads a CSV, applies filters, aggregates, and writes a Parquet output. The entire pipeline runs in lazy mode so the file is never fully loaded into memory:
import polars as pl
report = (
pl.scan_csv("sales_2025.csv", try_parse_dates=True)
.filter(pl.col("region") == "North America")
.filter(pl.col("date").dt.year() == 2025)
.group_by("product_category")
.agg([
pl.col("revenue").sum().alias("total_revenue"),
pl.col("revenue").mean().alias("avg_revenue"),
pl.col("customer_id").n_unique().alias("unique_customers")
])
.sort("total_revenue", descending=True)
.collect()
)
report.write_parquet("north_america_summary.parquet")
print(report)
Polars scans the CSV, applies filters before reading the full file, runs the aggregation in parallel, and writes a compressed Parquet output. If you want to skip Python setup entirely, VSLZ handles this kind of analysis from a plain file upload using natural language prompts.
Summary
Polars installs in one command and reads CSV, Parquet, and JSON directly. The expression-based API covers filtering, groupby, joins, and multi-column aggregations. Lazy mode handles large files without loading them into memory. Polars is 9x faster than pandas on joins and significantly more memory-efficient at scale. For datasets under 1GB or codebases already built on pandas, the migration cost often outweighs the benefit. For anything larger, Polars is the faster choice.
FAQ
What is Polars in Python?
Polars is a Python DataFrame library written in Rust. It reads CSV, Parquet, and JSON files and supports filtering, groupby, joins, and aggregation — the same analytical operations as pandas — but significantly faster on large datasets. It uses Apache Arrow's columnar memory format and runs operations in parallel across all CPU cores. Polars 1.0 shipped in July 2024 with a stable API.
Is Polars faster than pandas?
Yes, for most analytical workloads. In benchmarks across 10-million-row DataFrames, Polars completes joins in 2.1 seconds versus pandas at 18.7 seconds — a 9x difference. Filtering is about 4.6x faster; aggregation is about 2.6x faster. String-heavy regex operations are one exception where pandas can be faster by roughly 40%. For datasets under 1GB, the difference is often negligible.
How do I install Polars in Python?
Run `pip install polars` in your terminal. Polars ships self-contained with no additional dependencies — it does not require NumPy or any other package. After installing, import it with `import polars as pl` and verify with `print(pl.__version__)`.
What is lazy evaluation in Polars?
Lazy evaluation is Polars' mode where operations are not executed immediately. Instead, Polars builds a query plan, optimizes it, and executes everything at once when you call `.collect()`. Use `pl.scan_csv()` instead of `pl.read_csv()` to activate lazy mode. The main benefit is memory efficiency: Polars reads only the columns and rows your query needs, rather than loading the entire file.
When should I use Polars instead of pandas?
Use Polars when your dataset is larger than 1GB, when you need to run joins across millions of rows, or when pandas runs out of memory. Stick with pandas for datasets under 1GB, for workflows that rely on libraries that only accept pandas DataFrames (like scikit-learn or matplotlib), and for existing codebases where rewriting the expression syntax would be too disruptive.


