Get the latest tech news
From string to AST: parsing (2019)
Whether you have to do with data in form of CSV, JSON or a full-blooded programming language like C, JavaScript, Scala, or maybe a query language like SQL, you always transform some sequence of characters (or binary values) into a structured representation. Whatever you’ll do with that representation depends on your domain and business goals, and is quite often the core value of whatever you are doing. With a plethora of tools doing the parsing for us (including the error-handling), we might easily overlook how complex and interesting process it is.
First of all, most input formats that we handle follow some formal definition, telling e.g. how key-values are organized (JSON), how do you separate column names/values (CSV), how do you express projections and conditions (SQL). If you are interested about the process of implementing regular expressions and building finite state machines out of the regexp format I recommend getting a book like Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, and Ullman. (It should also explain why certain languages have weird rules regarding class/method/function/variable names - since tokenization takes place in the very beginning, it has to reliable classify each piece of code unambiguously as a terminal symbol), alternatively allow you to use regular expressions directly in a parser-defining syntax.
Or read this on Hacker News