Get the latest tech news
Should you ditch Spark for DuckDB or Polars?
There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?
With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert. No APIs or semi-structured data to make things too complex—just the typical operations that would result if you had Parquet files being delivered as a starting place and the goal was to build a dimensional model to support reporting and ad-hoc queries. Since the performance difference for VACUUM, OPTIMIZE, and Ad-hoc/Interactive Queries tends to be overshadowed by longer-running ELT processes, here’s an isolated view of the 10GB 4-vCore benchmark highlighting how much faster DuckDB and Polars (with Delta-rs) are for these workloads.
Or read this on Hacker News