Get the latest tech news

Should you ditch Spark for DuckDB or Polars?


There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert. No APIs or semi-structured data to make things too complex—just the typical operations that would result if you had Parquet files being delivered as a starting place and the goal was to build a dimensional model to support reporting and ad-hoc queries. Since the performance difference for VACUUM, OPTIMIZE, and Ad-hoc/Interactive Queries tends to be overshadowed by longer-running ELT processes, here’s an isolated view of the 10GB 4-vCore benchmark highlighting how much faster DuckDB and Polars (with Delta-rs) are for these workloads.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of DuckDB

DuckDB

Photo of Spark

Spark

Photo of Polars

Polars

Related news:

News photo

Thoughts on DuckDB's Crazy Grammar Thing

News photo

Northvolt Files For Bankruptcy as Europe's Battery Champion Loses Spark

News photo

Non-elementary group-by aggregations in Polars vs pandas