Get the latest tech news
Memory Efficient Data Streaming to Parquet Files
How to efficiently stream data into Parquet files.
While 1 GB of ram isn’t a big deal for the average engineer’s laptop, it’s actually quite a bit of overhead for a connector that would otherwise only really care about a single row at a time. The core of this methodology involves “transposing” incoming streaming data from a row-oriented to a column-oriented structure using an intermediate scratch file that is stored on disk rather than in memory. One thing that’s obvious from this strategy is that we need to be able to interact with Parquet files at a relatively low level to have direct access to reading and writing column chunks and row groups.
Or read this on Hacker News