Roam Parser Writeup

How to use

Click on “Load V1 Syntax” to load the default syntax that I designed for the parser.

You can change the syntax (EBNF) and/or the message to be parsed. When you click “Parse” the syntax is used to build a parser to parse the message. The result is displayed up top as a serialized parse tree (along with an auxillary debug tree). There is also an attempt to render the parse tree as html. The total time required to parse the message is also shown.

A few test cases are provided as “Test Messages”. You can select them from the dropdown menu and click “Parse” to view their parse trees. Alternatively, click “Test All Cases” to test your syntax against all provided test cases. Not all tests pass Nested refs doesn't render properly for now since I haven't written a function to transform the raw tree into html properly for this case. .

The provided syntax/grammar is by no means complete but demonstrates the simpler, declarative nature of using parser generators. If full-backwards compatibility with existing formatted documents is not an issue, I would start off with a well-defined subset of Markdown like StrictMark. This will doubtless handle many more cases that I did not have the time to think of or test for.

Design Notes

Why a parser generator?

This is implemented using a parser generator (Instaparse) in Clojurescript. A parser generator has a few advantages:

It’s easier to understand when the grammar is declared declaratively. A lot of complexity in parsing Markdown comes from ambiguous constructs.
Also easier to maintain since there is a lot less code. For comparison, the hand-written CommonMark C parser is 10KLoC.
Speed (hand-written parsers are notoriously difficult to make fast) though there are many serious programming languages with hand-rolled parsers (C, C++ in GCC and Clang, Golang) .

One advantage of a hand-rolled parser is greater control over error codes and failure modes. This isn’t a big issue when parsing Roam syntax since input that is not parsed correctly can be corrected by the user iteratively.

Benchmarking

I did some basic benchmarking on a test input with 100 lines of 100 characters each. Each line has one ref and one bolded section of text. The parser took 220ms for a complete parse + some text replacements. With 100 lines of 100 characters of pure text (no formatting/refs etc.) it took 50ms.

It is unclear if this approach will continue scaling well in terms of speed when the rest of the grammar is implemented. While implementing grammars, I ran into several ways of defining grammars which produced similar output but with wildly varying runtimes. This is an area that will require more thought if we want to keep the parse time low.

Extensions

Complex tasks like Latex, Insect and code block rendering are best done by their own separate modules. One idea is to wrappers around these modules for use with insta/transform. Alternatively, these blocks could be parsed in separate passes.

Process Notes

The result, to my biased eye, looks relatively simple. I spent about 50% of my time on this (est. 10 hours in many small chunks) learning re-frame, reagent, and instaparse. I spent the remainder designing the EBNF grammar and thinking of edge cases. I am using Instaparse to do the heavy lifting. I use insta/transform to replace nodes (e.g. do computations for insect blocks, though I actually just do a js/eval and do not call the actual insect library).