Get the latest tech news

The Birth of Parquet

By Julien Le Dem

On the other end, there was Vertica, a state-of-the-art Massively Parallel Processing database, leveraging columnar storage and vectorization to achieve low latency results. One of the benefits is that when you need to retrieve only a subset of the columns, which is very common, you can much more efficiently scan them from the disk in big chunks rather than doing a lot of small seeks. I took inspiration from the existing formats I could find (TFile, RCFile, CIF, Trevni) and the context of schema definition at Twitter (Thrift and Pig) and, over the summer of 2012, I started implementing the column spliting algorithm described in the Dremel paper.

Get the Android app

Or read this on Hacker News