Get the latest tech news

From string to AST: parsing (2019)


Whether you have to do with data in form of CSV, JSON or a full-blooded programming language like C, JavaScript, Scala, or maybe a query language like SQL, you always transform some sequence of characters (or binary values) into a structured representation. Whatever you’ll do with that representation depends on your domain and business goals, and is quite often the core value of whatever you are doing. With a plethora of tools doing the parsing for us (including the error-handling), we might easily overlook how complex and interesting process it is.

First of all, most input formats that we handle follow some formal definition, telling e.g. how key-values are organized (JSON), how do you separate column names/values (CSV), how do you express projections and conditions (SQL). If you are interested about the process of implementing regular expressions and building finite state machines out of the regexp format I recommend getting a book like Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, and Ullman. (It should also explain why certain languages have weird rules regarding class/method/function/variable names - since tokenization takes place in the very beginning, it has to reliable classify each piece of code unambiguously as a terminal symbol), alternatively allow you to use regular expressions directly in a parser-defining syntax.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of string

string

Photo of AST

AST

Related news:

News photo

Firm hacked after accidentally hiring North Korean cyber criminal. It is the latest in a string of cases of western remote workers being unmasked as North Koreans.

News photo

Why does ++[[]][+[]]+[+[]] return the string "10"?

News photo

A popular but wrong way to convert a string to uppercase or lowercase