Get the latest tech news
Fast columnar JSON decoding with arrow-rs
JSON is the most common serialization format used in streaming pipelines, so it pays to be able to deserialize it fast. This post covers in detail how the arrow-json library works to perform very efficient columnar JSON decoding, and the additions we've made for streaming use cases.
In general, operations are much more efficient if we can perform them along the grain of columns; in other words we want to decode all of the ips, then all of the identities, then all of the user_ids, as this allows us to avoid repeatedly downcasting (interpreting as a concrete type) our generically-typed arrays. To start decoding, we push a Value token onto the parse stack, which indicates we're looking for a valid JSON Value—one of a null, true, or false literal, a string, a number, an array, or an object; these are all distinguishable by looking at a single character. Because we've identified all of the components of the document and we know the schema, we can determine up-front the tape indices of all of the data that will go into a particular array—meaning we can process them all at once, in a tight, efficient loop, and only needing to downcast the array a single time.
Or read this on Hacker News